[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705104#comment-17705104 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148598484 ########## src/main/thrift/parquet.thrift: ########## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** The number of bytes the row/group or page would take if encoded with plain-encoding */ + 7: optional i64 plain_encoded_bytes; Review Comment: I think we end up with if/else statements anyways when we start accounting for logical types so I am warming to the idea of only storing bytes for variable width types. I originally thought plain encoding was useful because it is already a well understood concept within parquet but it might not add value here. I agree nulls can be a significant portion of space but I don't think it should be baked into this field and instead handled as a computation on the histograms below. My rationale for keeping them separate is the space taken really depends on memory representation which can differ by reader. The actual nullness some system use 1 bit per null indicator, some use one byte and some use 0 because they use sentinel values in data fields. Similarly some system reserve space in data representations for null values and some do not. Whether a system leaves space can also vary depending on at what level a null occurs in a nested structure. I think we should provide guidance that writers should populate all three of the newly proposed fields so estimates can be accurate in the presence of nulls. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)