etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303197041
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
> f you are not reading a parquet file in the streaming fashion, why
SizeStatistics in the column-chunk level is not enough? The pages of different
columns are not aligned and you somehow will end up with reading the entire
column chunk.
@wgtmac just because the pages aren't aligned doesn't mean I have to read
them all :wink: In a large row group with small pages, the non-alignment can be
minimized and there can still be a win from not reading unnecessary pages.
As to why the column-chunk level sizing info isn't enough, I have files
where the un-encoded size of the file is over 40X larger than the on-disk
sizes, due primarily to vast savings in the dictionary encoding. So a 1GB row
group could potentially blow up to 40GB when fully decoded. In the constrained
environment of a GPU that's not tenable. Being able to know in advance which
pages I can read and decode while still keeping everything on the GPU is very
beneficial. To get this sizing information now, we have to read and decompress
every page, doing most of the work of decoding the file just to find the total
size of all the byte arrays. I'd prefer not to have to make 2 passes through
the file :smile:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]