[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Tue, 12 Sep 2023 06:46:38 -0700


JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323069846



##########
src/main/thrift/parquet.thrift:
##########
@@ -764,6 +810,14 @@ struct ColumnMetaData {
    * in a single I/O.
    */
   15: optional i32 bloom_filter_length;
+
+  /**

Review Comment:
   (not related to this line, but to `ColumnMetaData` in general)
   
   For completeness-reasons, we might also want to add 
`unencoded_byte_array_data_bytes` and `num_entries` for the dictionary page (if 
existent) into the ColumnMetadata, i.e., 
`dictionary_unencoded_byte_array_data_bytes` and `num_dictionary_entries`. 
   
   This way, readers could plan how much memory the dictionary of a column 
chunk will take. This can help in decisions whether, e.g., to load the 
dictionary up-front to perform pre-filtering on the dictionary.
   
   I'm not suggesting that this is a must-have for this commit or at all, so 
feel free to drop this issue. I just wanted to voice that if we already want to 
provide tools for size estimation, the dictionary is currently not really 
accounted for.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to