[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705095#comment-17705095 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148580752 ########## src/main/thrift/parquet.thrift: ########## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** The number of bytes the row/group or page would take if encoded with plain-encoding */ + 7: optional i64 plain_encoded_bytes; Review Comment: I'd rather do this for all primitive types. Otherwise we may end up with different if/else approaches for different types which complicates the logic. ########## src/main/thrift/parquet.thrift: ########## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** The number of bytes the row/group or page would take if encoded with plain-encoding */ + 7: optional i64 plain_encoded_bytes; Review Comment: However, I hold a different opinion that null should be considered here. For example, if most of values are null, we may under-estimate the size and make the split planning much more imprecise. Although we have `null_count` somewhere (e.g. Statistics or ColumnIndex) but they are optional and may be missing even if `plain_encoded_bytes` here is provided. If we can enforce `null_count` with `plain_encoded_bytes` to appear together, then this is not an issue. I am not sure if we can slightly modify the definition of `plain_encoded_bytes` to mean the total bytes as if the data is all non-null and plain-encoded. For numeric types, a null value consumes sizeof(T) bytes. For FLBA type, we can simply assume a null value still consumes its original fixed-length. For BYTE_ARRAY type, a null value consume empty length. That said, I am still open to follow the proposed approach from @emkornfield because it is simple to understand. ########## src/main/thrift/parquet.thrift: ########## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** The number of bytes the row/group or page would take if encoded with plain-encoding */ + 7: optional i64 plain_encoded_bytes; + /** + * When present there is expected to be one element corresponding to each repetition (i.e. size=max repetition_leve) + * where each element represens the count of the number of times that level occurs in the page/column chunk. + */ + 8: optional list<i64> repetition_level_histogram; Review Comment: These levels are rather complicated for high-level applications, we may need to provide some utility to translate to norms (e.g. struct/map/array/null) that are much easier to understand. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)