[I] Document the behavior of Nested ColumnChunk statistics [parquet-format]

via GitHub Tue, 07 Jan 2025 02:41:06 -0800


coastalwhite opened a new issue, #476:
URL: https://github.com/apache/parquet-format/issues/476


   ### Describe the enhancement requested
   
   At the moment, it is not clear to me what the semantics are of the 
`ColumnChunk`-level statistics of nested columns.
   
   It appears that it should be based on the leaf column (which makes sense to 
me), but then the `null_count` (and `distinct_count` probably) are seemingly 
based partially on the nested level.
   
   ```python
   import polars as pl
   import io
   import pyarrow.parquet as pq
   
   df = pl.DataFrame([
       pl.Series('a', [[1, 2, 3], None], pl.Array(pl.Int32, 3)),
   ])
   
   f = io.BytesIO()
   pq.write_table(df.to_arrow(), f)
   
   f.seek(0)
   pq.read_metadata(f).row_group(0).column(0).statistics
   ```
   
   ```console
   <pyarrow._parquet.Statistics object at 0x7ffe9bd626b0>
     has_min_max: True
     min: 1
     max: 3
     null_count: 1
     distinct_count: None
     num_values: 3
     physical_type: INT32
     logical_type: None
     converted_type (legacy): NONE
   ```
   
   I would expect the `null_count` to equal `3` here if it was based on the 
leaf column. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Document the behavior of Nested ColumnChunk statistics [parquet-format]

Reply via email to