coastalwhite opened a new issue, #476:
URL: https://github.com/apache/parquet-format/issues/476
### Describe the enhancement requested
At the moment, it is not clear to me what the semantics are of the
`ColumnChunk`-level statistics of nested columns.
It appears that it should be based on the leaf column (which makes sense to
me), but then the `null_count` (and `distinct_count` probably) are seemingly
based partially on the nested level.
```python
import polars as pl
import io
import pyarrow.parquet as pq
df = pl.DataFrame([
pl.Series('a', [[1, 2, 3], None], pl.Array(pl.Int32, 3)),
])
f = io.BytesIO()
pq.write_table(df.to_arrow(), f)
f.seek(0)
pq.read_metadata(f).row_group(0).column(0).statistics
```
```console
<pyarrow._parquet.Statistics object at 0x7ffe9bd626b0>
has_min_max: True
min: 1
max: 3
null_count: 1
distinct_count: None
num_values: 3
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
```
I would expect the `null_count` to equal `3` here if it was based on the
leaf column.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]