[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 06 Sep 2023 09:56:50 -0700


JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317565318



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,74 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in
+  *   memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+    * When present, there is expected to be one element corresponding to each
+    * repetition (i.e. size=max repetition_level+1) where each element

Review Comment:
   why +1? Shouldn't this have one element if `repetition_level == 1`?



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,74 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in
+  *   memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+    * When present, there is expected to be one element corresponding to each
+    * repetition (i.e. size=max repetition_level+1) where each element
+    * represents the number of times the repetition level was observed in the
+    * data.
+    *
+    * This field may be omitted if max_repetition_level is 0.
+    **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This field may be omitted if max_definition_level is 0 or 1.
+    **/
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming
+    * no encoding. This is exclusive of the bytes needed to store the length of
+    * each byte array. In other words, this field is equivalent to the `(size
+    * of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+    * written)`. To determine unencoded sizes of other types readers can use
+    * schema information multiplied by the number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms
+    * below.
+    *
+    * For example, if a column chunk is dictionary-encoded with dictionary
+    * ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+    * then this value for that data page should be 7 (1 + 1 + 2 + 3).
+    *
+    * This field should only be set for types that use BYTE_ARRAY as their
+    * physical type.
+    */
+   1: optional i64 unencoded_byte_array_data_bytes;
+   /**
+    *
+    * Repetition and definition level histograms for this data page

Review Comment:
   ```suggestion
       * Repetition and definition level histograms for this data page or 
column chunk
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -764,6 +845,14 @@ struct ColumnMetaData {
    * in a single I/O.
    */
   15: optional i32 bloom_filter_length;
+
+  /**
+   * Optional statistics to help estimate total memory when converted to in
+   * memory representations. The histogram contained on these statistics can

Review Comment:
   ```suggestion
      * memory representations. The histograms contained on these statistics can
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,74 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in
+  *   memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+    * When present, there is expected to be one element corresponding to each
+    * repetition (i.e. size=max repetition_level+1) where each element
+    * represents the number of times the repetition level was observed in the
+    * data.
+    *
+    * This field may be omitted if max_repetition_level is 0.
+    **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This field may be omitted if max_definition_level is 0 or 1.
+    **/
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming
+    * no encoding. This is exclusive of the bytes needed to store the length of
+    * each byte array. In other words, this field is equivalent to the `(size
+    * of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+    * written)`. To determine unencoded sizes of other types readers can use
+    * schema information multiplied by the number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms
+    * below.
+    *
+    * For example, if a column chunk is dictionary-encoded with dictionary
+    * ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+    * then this value for that data page should be 7 (1 + 1 + 2 + 3).
+    *
+    * This field should only be set for types that use BYTE_ARRAY as their
+    * physical type.
+    */
+   1: optional i64 unencoded_byte_array_data_bytes;
+   /**
+    *
+    * Repetition and definition level histograms for this data page
+    *
+    * This field applies to all types.

Review Comment:
   ```suggestion
   ```
   
   I guess this is implied.



##########
src/main/thrift/parquet.thrift:
##########
@@ -529,7 +597,15 @@ struct DataPageHeader {
   /** Encoding used for repetition levels **/
   4: required Encoding repetition_level_encoding;
 
-  /** Optional statistics for the data in this page**/
+  /**
+   *  Optional statistics for the data in this page.

Review Comment:
   ```suggestion
      * Optional statistics for the data in this page.
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,74 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in
+  *   memory

Review Comment:
   ```suggestion
     *   1. Estimating the size of the data when materialized in memory
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -583,7 +659,12 @@ struct DataPageHeaderV2 {
   If missing it is considered compressed */
   7: optional bool is_compressed = true;
 
-  /** optional statistics for the data in this page **/
+  /** 
+   * optional statistics for the data in this page 

Review Comment:
   ```suggestion
      * Optional statistics for the data in this page 
   ```
   Also starts upper case on other field.



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   why do you group RepetitionLevelHistogram and DefinitionLevelHistogram into 
a struct? This way, we have additional encoding overhead, as we have to store a 
list of structs instead two lists of integers. This is quite costlier to decode.
   
   Also, it changes the memory layout of the deserialized classes. I guess I 
would rather prefer a columnar layout with two lists of ints instead of one 
list of structs, so that algorithms operating on only one of them can benefit 
from tighter memory layout. Also we don't need an object per entry then. 
   
   Given that the number of pages could be high, such optimization 
considerations seem prudent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to