[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

via GitHub Tue, 28 Mar 2023 19:49:12 -0700


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151338163



##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+    * The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+    * of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+    * to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+    * PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+    * of other other types readers can use schema information multiplied by 
the number of non-null values.
+    * The number of non-null values can be inferred from the histograms below.
+    *
+    * For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+    * has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+    *
+    * This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.
+    */
+   1: optional i64 logical_variable_width_stored_bytes;
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represens the number of time the repetition level 
was observed in the data.
+     *
+     * This value is optional if max_repetition_level is 0.
+     */
+   2: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as  repetition_level_histogram except for definition levels.
+    *
+    * This value is optional when max_definition_level is 0. 
+    */ 
+   3: optional list<i64> definition_level_histogram;

Review Comment:
   It might pay to illustrate exact queries, but if this is just answering a 
question is there any null element at a particular nesting level I think 
definition level histogram by itself gives that information.
   
   Take a nested lists where both lists and elements can be nullable at each 
level.  IIRC, the definition levels would represent as follows:
   0 - Null top level list.
   1  - empty top level list
   2 - null nested list
   3 - empty nested list
   4 - null leaf element
   5 - present leaf element
   
   So if the query is for top level list `is null`, one could prune when 
`def_level[0] == 0`.  For `is not null` one could prune if `def_level[0] == 
num_values from page (i.e. all values are null)`.  
   
   I believe similar logic holds for `def_level[2]` but could get more 
complicated depending on the semantics of whether a top level null element 
should imply a the nested list is also null or if an empty list implies the 
nested list should be considered null (but should still be derivable by using 
histogram indices 0,1 and 2).
   
   One thing the joint histogram (pairs of rep/def level counts) could give you 
is the number first list elements that are null, but I'm not sure how useful 
that is.  I would need to think about other queries the joint histogram would 
enable (or if you have more examples of supported queries we can figure out if 
one is needed). 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

Reply via email to