[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Tue, 22 Aug 2023 02:28:25 -0700


pitrou commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1301322378



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *      list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represents the number of time the repetition level 
was observed in the data.
+     *
+     * This value should not be written if max_repetition_level is 0.
+     **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This value should not be written when max_definition_level is 0. 

Review Comment:
   Nit: symmetry
   ```suggestion
       * This value should not be written if max_definition_level is 0. 
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *      list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represents the number of time the repetition level 
was observed in the data.

Review Comment:
   ```suggestion
        * When present, there is expected to be one element corresponding
        to each repetition (i.e. size=max repetition_level+1) 
        * where each element represents the number of times the repetition 
level was observed in the data.
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *      list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represents the number of time the repetition level 
was observed in the data.
+     *
+     * This value should not be written if max_repetition_level is 0.
+     **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This value should not be written when max_definition_level is 0. 
+    **/ 
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data written. This is useful for readers to estimate how much memory is 
needed 
+ * to reconstruct data in their memory model and for fine grained filter 
pushdown on nested
+ * structures (the histogram contained in this structure can help determine 
the 
+ * number of nulls at a particular nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {

Review Comment:
   "Estimation" can be misleading here. Are the values below exact or can they 
be approximate?



##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   Why `RepetitionDefinitionLevelHistogram` and not `SizeEstimationStatistics`? 
What is the rationale?



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *      list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represents the number of time the repetition level 
was observed in the data.
+     *
+     * This value should not be written if max_repetition_level is 0.
+     **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This value should not be written when max_definition_level is 0. 
+    **/ 
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data written. This is useful for readers to estimate how much memory is 
needed 
+ * to reconstruct data in their memory model and for fine grained filter 
pushdown on nested
+ * structures (the histogram contained in this structure can help determine 
the 
+ * number of nulls at a particular nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming 
no encoding. This is exclusive of the 
+    * bytes needed to store the length of each byte array. In other words, 
this field is equivalent to the `(size of 
+    * PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written)`. To determine unencoded sizes 
+    * of other types readers can use schema information multiplied by the 
number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms 
below.
+    *
+    * For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+    * has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+    *
+    * This field should only be set for types that use BYTE_ARRAY as their 
physical type.
+    */
+   1: optional i64 unencoded_variable_width_stored_bytes;
+   /**
+    *
+    * This field should not be written when maximum definition and repetition 
level are both 0.

Review Comment:
   ```suggestion
       * Repetition and definition level histograms for this data page 
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 

Review Comment:
   Nit: simplify wording?
   ```suggestion
     * A histogram of repetition and definition levels for either a page or 
column chunk. 
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *      list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represents the number of time the repetition level 
was observed in the data.
+     *
+     * This value should not be written if max_repetition_level is 0.
+     **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This value should not be written when max_definition_level is 0. 
+    **/ 
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data written. This is useful for readers to estimate how much memory is 
needed 
+ * to reconstruct data in their memory model and for fine grained filter 
pushdown on nested
+ * structures (the histogram contained in this structure can help determine 
the 
+ * number of nulls at a particular nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming 
no encoding. This is exclusive of the 
+    * bytes needed to store the length of each byte array. In other words, 
this field is equivalent to the `(size of 
+    * PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written)`. To determine unencoded sizes 
+    * of other types readers can use schema information multiplied by the 
number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms 
below.
+    *
+    * For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+    * has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+    *
+    * This field should only be set for types that use BYTE_ARRAY as their 
physical type.
+    */
+   1: optional i64 unencoded_variable_width_stored_bytes;
+   /**
+    *
+    * This field should not be written when maximum definition and repetition 
level are both 0.
+    *
+    * This field applies to all types.
+    */ 
+   2: optional RepetitionDefinitionLevelHistogram 
repetition_definition_level_histogram;

Review Comment:
   Why make this optional when the two members in 
`RepetitionDefinitionLevelHistogram` are already optional? 



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,62 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * Tracks a histogram of repetition and definition levels for either a page 
or column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *      list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represents the number of time the repetition level 
was observed in the data.
+     *
+     * This value should not be written if max_repetition_level is 0.
+     **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This value should not be written when max_definition_level is 0. 
+    **/ 
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data written. This is useful for readers to estimate how much memory is 
needed 
+ * to reconstruct data in their memory model and for fine grained filter 
pushdown on nested
+ * structures (the histogram contained in this structure can help determine 
the 
+ * number of nulls at a particular nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming 
no encoding. This is exclusive of the 
+    * bytes needed to store the length of each byte array. In other words, 
this field is equivalent to the `(size of 
+    * PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written)`. To determine unencoded sizes 
+    * of other types readers can use schema information multiplied by the 
number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms 
below.
+    *
+    * For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+    * has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).

Review Comment:
   ```suggestion
       * For example, if a column chunk is dictionary-encoded with dictionary 
["a", "bc", "cde"],
       * and a data page contains the indices [0, 0, 1, 2], then this value for 
that data page
       * should be 7 (1 + 1 + 2 + 3).
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to