[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705175#comment-17705175 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148770596 ########## src/main/thrift/parquet.thrift: ########## @@ -190,6 +190,35 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED = 2; } +/** + * A structure for capturing metadata for estimating the unencoded, uncompressed size + * of data. + */ +struct SizeEstimationStatistics { + /** + * The number of logic bytes needed to store present/non-null values. + * Unless specified below, the computed size is the size it would take to plain-encode the underlying + * physical type. + * Special calculations: + * - Enum: plain-encoded BYTE_ARRAY size + * - Integers (same size used for signed and unsigned): int8 - 1 bytes, int16 - 2 + * - Decimal - Each value is assumed to take the minimal number of bytes necessary to encode + * the precision of the decimal value. + * - Nested types (lists, nested groups and maps) - No additional size for these structures + * are accounted for in this field, instead the histogram fields below can be + * be used to estimate overhead to recreate these structures. + */ + 1: optional i64 logical_value_byte_storage; Review Comment: Still up for discussion my preference here is now to change this to only store variable width bytes excluding length, and let readers compute size as they desire based on the type of the column. CC @mapleFU @wgtmac since it isn't clear to me that the comment thread is preserved across commits. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)