JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323069846
########## src/main/thrift/parquet.thrift: ########## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /** Review Comment: (not related to this line, but to `ColumnMetaData` in general) For completeness-reasons, we might also want to add `unencoded_byte_array_data_bytes` and `num_entries` for the dictionary page (if existent) into the ColumnMetadata, i.e., `dictionary_unencoded_byte_array_data_bytes` and `num_dictionary_entries`. This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org