[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758266#comment-17758266 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- GregoryKimball commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1690724411 Thank you @emkornfield for suggesting this change, @pitrou for your [comment](https://github.com/apache/parquet-format/pull/197#discussion_r1301338683) and @mapleFU, @wgtmac, @gszadovszky, @etseidl for the discussion. In the libcudf [chunked parquet reader](https://docs.rapids.ai/api/libcudf/stable/classcudf_1_1io_1_1chunked__parquet__reader), we would benefit greatly from having `SizeStatistics` added to `ColumnIndex` such as: ``` ColumnMetaData: optional SizeStatistics size_estimate_statistics; ColumnIndex: optional list<SizeStatistics> size_estimate_statistics; ``` We would benefit from having page-level values for `unencoded_variable_width_stored_bytes` because it would help us step through a row group to yield consistently-sized table "chunks". We created the chunked reader to read row groups that explode to >10-100 GB tables when decompressed and decoded. The `repetition_definition_level_histograms` is also useful for estimating row count per page and aligning the pages between ColumnChunks. We don't need to track `FullSizeStatistics` in our use case, just the histograms and `unencoded_variable_width_stored_bytes` at the page-level will suffice. Thank you for your help! > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)