[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705195#comment-17705195
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148822969


##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+    * The number of logic bytes needed to store present/non-null values.
+    * Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+    * physical type.
+    * Special calculations:
+    *  - Enum: plain-encoded BYTE_ARRAY size
+    *  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+    *  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+    *    the precision of the decimal value.
+    *  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+    *    are accounted for in this field, instead the histogram fields below 
can be
+    *    be used to estimate overhead to recreate these structures.
+    */
+   1: optional i64 logical_value_byte_storage;
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represens the number of time the repetition level 
was observed in the data.
+     */
+   2: optional list<i64> repetition_level_histogram;

Review Comment:
   There are a few things to consider here:
   1.  What happens if max rep/dep level is zero (should we require these).  
This also relates should the size be max_dep_level + 1 or max_dep_level.  The 
first allows readers to sanity check the statistics sum to num_values, the 
second does not
   2. Should we require variable size bytes if the column doesn't have any (0 
is an acceptable value here)?
   3. it has kind of been drilled into me that any message that lives long 
enough having a required field one will live to regret it.  I'd prefer to 
document that writers should populate relevant fields (and be specific about 
when we believe they are relevant).





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to