[ 
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377493#comment-17377493
 ] 

Wes McKinney commented on ARROW-12513:
--------------------------------------

I agree that definitely we should compute the accurate statistics given the 
observed dictionary values and account for the nulls (if any) in the dictionary 
indices. This was an oversight on my part (the fact that a 0 null count would 
be written when using the dictionary values to compute the statistics) 

> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics 
> for dictionary-encoded array with nulls
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12513
>                 URL: https://issues.apache.org/jira/browse/ARROW-12513
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet, Python
>    Affects Versions: 1.0.1, 2.0.0, 3.0.0
>         Environment: RHEL6
>            Reporter: David Beach
>            Assignee: Kirill Lykov
>            Priority: Critical
>              Labels: parquet-statistics
>
> When writing a Table as Parquet, when the table contains columns represented 
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0 
> in the Parquet metadata.  If the same data is saved without 
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ 
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to