[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705161#comment-17705161 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148748310 ########## src/main/thrift/parquet.thrift: ########## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** The number of bytes the row/group or page would take if encoded with plain-encoding */ + 7: optional i64 plain_encoded_bytes; Review Comment: > We need to look at different levels of metadata or even perform some computation to gather the information required above. So my point is to write the raw size info for every data type (with logical type considered) and store/aggregate them into page and column-chunk levels (or even file level?). That would make life easier as the time spent in the planning stage is critical to some analytics use cases. @wgtmac would the following changes suffice to address your concerns: 1. Change the name of the fields to `logical_stored_value_bytes` and define the byte count for each logical type (for Decimal, I'd propose using the underlying size of what it would take to use plain-encoding, for BYTE_ARRAY in this case, for consistency I think this means for BYTE_ARRAY we should also use the amount of space PLAIN_ENCODING would take). 2. Extract the three fields into a new struct something like:`SizeEstimationStatistics`. 3. In addition to placing this struct into Statistics (which takes care of column level and page level) stats, also put it onto RowGroup? I'd hesitate to put it at the file level because this seems out of character with other metadata) and summing across row groups should be lightweight compared to the overhead of parsing the FileMetadata anyways? 4. (Optional) If we were really concerned about optimizations we could convert the histogram to cumulative distribution function, which would avoid summing to get leaf-nulls. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)