David Beach created ARROW-12513:
-----------------------------------

             Summary: Parquet Writer always puts null_count=0 in Parquet 
statistics for dictionary-encoded array with nulls
                 Key: ARROW-12513
                 URL: https://issues.apache.org/jira/browse/ARROW-12513
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Parquet, Python
    Affects Versions: 3.0.0, 2.0.0, 1.0.1
         Environment: RHEL6
            Reporter: David Beach


When writing a Table as Parquet, when the table contains columns represented as 
dictionary-encoded arrays, those columns show an incorrect null_count of 0 in 
the Parquet metadata.  If the same data is saved without dictionary-encoding 
the array, then the null_count is correct.

Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.

NOTE: I'm a PyArrow user, but I believe this but is actually in the C++ 
implementation of the Arrow/Parquet writer.
h3. Setup
{code:python}
import pyarrow as pa
from pyarrow import parquet{code}
h3. Bug

(writes a dictionary encoded Arrow array to parquet)
{code:python}
array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
assert array1.null_count == 5
array1dict = array1.dictionary_encode()
assert array1dict.null_count == 5
table = pa.Table.from_arrays([array1dict], ["mycol"])
parquet.write_table(table, "testtable.parquet")
meta = parquet.read_metadata("testtable.parquet")
meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
h3. Correct

(writes same data without dictionary encoding the Arrow array)
{code:python}
array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
assert array1.null_count == 5
table = pa.Table.from_arrays([array1], ["mycol"])
parquet.write_table(table, "testtable.parquet")
meta = parquet.read_metadata("testtable.parquet")
meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to