[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705153#comment-17705153 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148704782 ########## src/main/thrift/parquet.thrift: ########## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** The number of bytes the row/group or page would take if encoded with plain-encoding */ + 7: optional i64 plain_encoded_bytes; Review Comment: I agree on putting non-null data and null data into separate fields. The space for null values can have a significant impact on memory footprint so I want to employ these statistics to derive a good batch size while reading data. It also makes sense to store un-encoded bytes for only variable-length types (in the parquet specs it solely means BYTE_ARRAY type.) But that is not easy to use in these cases: - Get the total raw size of the file (a.k.a. that size of all columns). - Get the total raw size of some selected columns. - Get the total raw size of selected columns in some row groups. - ... We need to look at different levels of metadata or even perform some computation to gather the information required above. So my point is to write the raw size info for every data type (with logical type considered) and store/aggregate them into page and column-chunk levels (or even file level?). That would make life easier as the time spent in the planning stage is critical to some analytics use cases. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)