[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705195#comment-17705195 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148822969 ########## src/main/thrift/parquet.thrift: ########## @@ -190,6 +190,35 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED = 2; } +/** + * A structure for capturing metadata for estimating the unencoded, uncompressed size + * of data. + */ +struct SizeEstimationStatistics { + /** + * The number of logic bytes needed to store present/non-null values. + * Unless specified below, the computed size is the size it would take to plain-encode the underlying + * physical type. + * Special calculations: + * - Enum: plain-encoded BYTE_ARRAY size + * - Integers (same size used for signed and unsigned): int8 - 1 bytes, int16 - 2 + * - Decimal - Each value is assumed to take the minimal number of bytes necessary to encode + * the precision of the decimal value. + * - Nested types (lists, nested groups and maps) - No additional size for these structures + * are accounted for in this field, instead the histogram fields below can be + * be used to estimate overhead to recreate these structures. + */ + 1: optional i64 logical_value_byte_storage; + /** + * When present there is expected to be one element corresponding to each repetition (i.e. size=max repetition_level+1) + * where each element represens the number of time the repetition level was observed in the data. + */ + 2: optional list<i64> repetition_level_histogram; Review Comment: There are a few things to consider here: 1. What happens if max rep/dep level is zero (should we require these). This also relates should the size be max_dep_level + 1 or max_dep_level. The first allows readers to sanity check the statistics sum to num_values, the second does not 2. Should we require variable size bytes if the column doesn't have any (0 is an acceptable value here)? 3. it has kind of been drilled into me that any message that lives long enough having a required field one will live to regret it. I'd prefer to document that writers should populate relevant fields (and be specific about when we believe they are relevant). > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)