[
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029822#comment-17029822
]
Francois Saint-Jacques commented on PARQUET-1783:
-------------------------------------------------
There's a
[TODO|https://github.com/apache/arrow/blob/0326ea34b63ae399582a99d60f0d23cc03aaa628/cpp/src/parquet/column_writer.cc#L1179-L1183]
about it.
> [C++] Parquet statistics wrong for dictionary type
> --------------------------------------------------
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.6.0
> Reporter: Florian Jetter
> Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer
> to the entire {{CategoricalDtype}} instead of the data included in the row
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> <pyarrow._parquet.Statistics object at 0x1163b5280>
> has_min_max: True
> min: 1
> max: 42
> null_count: 0
> distinct_count: 0
> num_values: 1
> physical_type: BYTE_ARRAY
> logical_type: String
> converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>
> Tested with
> pandas==1.0.0
> pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master /
> essentially 0.16.0)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)