[GitHub] [parquet-format] GregoryKimball commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 23 Aug 2023 15:25:50 -0700


GregoryKimball commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1690724411


   Thank you @emkornfield for suggesting this change, @pitrou for your 
[comment](https://github.com/apache/parquet-format/pull/197#discussion_r1301338683)
 and @mapleFU, @wgtmac, @gszadovszky, @etseidl for the discussion.
   
   In the libcudf [chunked parquet 
reader](https://docs.rapids.ai/api/libcudf/stable/classcudf_1_1io_1_1chunked__parquet__reader),
 we would benefit greatly from having `SizeStatistics` added to `ColumnIndex` 
such as:
   ```
   ColumnMetaData:
   optional SizeStatistics size_estimate_statistics;
   
   ColumnIndex:
   optional list<SizeStatistics> size_estimate_statistics;
   ```
   
   We would benefit from having page-level values for 
`unencoded_variable_width_stored_bytes` because it would help us step through a 
row group to yield consistently-sized table "chunks". We created the chunked 
reader to read row groups that explode to >10-100 GB tables when decompressed 
and decoded.
   
   The `repetition_definition_level_histograms` is also useful for estimating 
row count per page and aligning the pages between ColumnChunks. We don't need 
to track `FullSizeStatistics` in our use case, just the histograms and 
`unencoded_variable_width_stored_bytes` at the page-level will suffice. 
   
   Thank you for your help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] GregoryKimball commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to