[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764196#comment-17764196 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323069846 ########## src/main/thrift/parquet.thrift: ########## @@ -764,6 +810,14 @@ struct ColumnMetaData { * in a single I/O. */ 15: optional i32 bloom_filter_length; + + /** Review Comment: (not related to this line, but to `ColumnMetaData` in general) For completeness-reasons, we might also want to add `unencoded_byte_array_data_bytes` and `num_entries` for the dictionary page (if existent) into the ColumnMetadata, i.e., `dictionary_unencoded_byte_array_data_bytes` and `num_dictionary_entries`. This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary. I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)