[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 23 Aug 2023 08:31:43 -0700


etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303197041



##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   > f you are not reading a parquet file in the streaming fashion, why 
SizeStatistics in the column-chunk level is not enough? The pages of different 
columns are not aligned and you somehow will end up with reading the entire 
column chunk.
   
   @wgtmac just because the pages aren't aligned doesn't mean I have to read 
them all :wink: In a large row group with small pages, the non-alignment can be 
minimized and there can still be a win from not reading unnecessary pages.
   
   As to why the column-chunk level sizing info isn't enough, I have files 
where the un-encoded size of the file is over 40X larger than the on-disk 
sizes, due primarily to vast savings in the dictionary encoding. So a 1GB row 
group could potentially blow up to 40GB when fully decoded.  In the constrained 
environment of a GPU that's not tenable.  Being able to know in advance which 
pages I can read and decode while still keeping everything on the GPU is very 
beneficial.  To get this sizing information now, we have to read and decompress 
every page, doing most of the work of decoding the file just to find the total 
size of all the byte arrays. I'd prefer not to have to make 2 passes through 
the file :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to