David Beach created ARROW-12513: ----------------------------------- Summary: Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls Key: ARROW-12513 URL: https://issues.apache.org/jira/browse/ARROW-12513 Project: Apache Arrow Issue Type: Bug Components: C++, Parquet, Python Affects Versions: 3.0.0, 2.0.0, 1.0.1 Environment: RHEL6 Reporter: David Beach
When writing a Table as Parquet, when the table contains columns represented as dictionary-encoded arrays, those columns show an incorrect null_count of 0 in the Parquet metadata. If the same data is saved without dictionary-encoding the array, then the null_count is correct. Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0. NOTE: I'm a PyArrow user, but I believe this but is actually in the C++ implementation of the Arrow/Parquet writer. h3. Setup {code:python} import pyarrow as pa from pyarrow import parquet{code} h3. Bug (writes a dictionary encoded Arrow array to parquet) {code:python} array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) assert array1.null_count == 5 array1dict = array1.dictionary_encode() assert array1dict.null_count == 5 table = pa.Table.from_arrays([array1dict], ["mycol"]) parquet.write_table(table, "testtable.parquet") meta = parquet.read_metadata("testtable.parquet") meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!){code} h3. Correct (writes same data without dictionary encoding the Arrow array) {code:python} array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) assert array1.null_count == 5 table = pa.Table.from_arrays([array1], ["mycol"]) parquet.write_table(table, "testtable.parquet") meta = parquet.read_metadata("testtable.parquet") meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)