[ 
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377582#comment-17377582
 ] 

Micah Kornfield commented on ARROW-12513:
-----------------------------------------

> This makes sense for min/max. 

This doesn't necessarily make sense even for dictionary values.  It only makes 
sense if all values in the dictionary are referenced.  Otherwise the bounds 
still ends up being loose (there is was an open bug on this).  

> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics 
> for dictionary-encoded array with nulls
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12513
>                 URL: https://issues.apache.org/jira/browse/ARROW-12513
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet, Python
>    Affects Versions: 1.0.1, 2.0.0, 3.0.0
>         Environment: RHEL6
>            Reporter: David Beach
>            Assignee: Kirill Lykov
>            Priority: Critical
>              Labels: parquet-statistics
>
> When writing a Table as Parquet, when the table contains columns represented 
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0 
> in the Parquet metadata.  If the same data is saved without 
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ 
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to