[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758117#comment-17758117 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303197041 ########## src/main/thrift/parquet.thrift: ########## @@ -974,6 +1050,13 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate the + * widest range of readers both should be populated. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms; Review Comment: > f you are not reading a parquet file in the streaming fashion, why SizeStatistics in the column-chunk level is not enough? The pages of different columns are not aligned and you somehow will end up with reading the entire column chunk. @wgtmac just because the pages aren't aligned doesn't mean I have to read them all :wink: In a large row group with small pages, the non-alignment can be minimized and there can still be a win from not reading unnecessary pages. As to why the column-chunk level sizing info isn't enough, I have files where the un-encoded size of the file is over 40X larger than the on-disk sizes, due primarily to vast savings in the dictionary encoding. So a 1GB row group could potentially blow up to 40GB when fully decoded. In the constrained environment of a GPU that's not tenable. Being able to know in advance which pages I can read and decode while still keeping everything on the GPU is very beneficial. To get this sizing information now, we have to read and decompress every page, doing most of the work of decoding the file just to find the total size of all the byte arrays. I'd prefer not to have to make 2 passes through the file :smile: > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)