[jira] [Created] (ARROW-7732) [Python][C++] Parquet statistics wrong for pandas Categorical

Florian Jetter (Jira) Fri, 31 Jan 2020 02:23:37 -0800

Florian Jetter created ARROW-7732:
-------------------------------------

             Summary: [Python][C++] Parquet statistics wrong for pandas 
Categorical
                 Key: ARROW-7732
                 URL: https://issues.apache.org/jira/browse/ARROW-7732
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1, 0.16.0
            Reporter: Florian Jetter



h3. Observed behaviour

Statistics for categorical data are equivalent for all row groups and refer to 
the entire {{CategoricalDtype}} instead of the data included in the row group.
h3. Expected behaviour

The row group statistics should only include data which is part of the actual 
row group, not the entire {{CategoricalDtype}}
h3. Minimal example
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
table = pa.Table.from_pandas(test_df)
pq.write_table(
    table,
    "test_parquet",
    chunk_size=1,
)
test_parquet = pq.ParquetFile("test_parquet")
test_parquet.metadata.row_group(0).column(0).statistics
{code}
{code:java}
Out[1]:
<pyarrow._parquet.Statistics object at 0x1163b5280>
  has_min_max: True
  min: 1
  max: 42
  null_count: 0
  distinct_count: 0
  num_values: 1
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8
{code}
Expected would be

{{min:1}} {{max:1}} instead of {{max: 42}} for the first row group

 

Tested with 
 pandas==1.0.0
 pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7732) [Python][C++] Parquet statistics wrong for pandas Categorical

Reply via email to