Florian Jetter created ARROW-7732: ------------------------------------- Summary: [Python][C++] Parquet statistics wrong for pandas Categorical Key: ARROW-7732 URL: https://issues.apache.org/jira/browse/ARROW-7732 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1, 0.16.0 Reporter: Florian Jetter
h3. Observed behaviour Statistics for categorical data are equivalent for all row groups and refer to the entire {{CategoricalDtype}} instead of the data included in the row group. h3. Expected behaviour The row group statistics should only include data which is part of the actual row group, not the entire {{CategoricalDtype}} h3. Minimal example {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) table = pa.Table.from_pandas(test_df) pq.write_table( table, "test_parquet", chunk_size=1, ) test_parquet = pq.ParquetFile("test_parquet") test_parquet.metadata.row_group(0).column(0).statistics {code} {code:java} Out[1]: <pyarrow._parquet.Statistics object at 0x1163b5280> has_min_max: True min: 1 max: 42 null_count: 0 distinct_count: 0 num_values: 1 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 {code} Expected would be {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group Tested with pandas==1.0.0 pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)