[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318576796


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,73 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+* When present, there is expected to be one element corresponding to each
+* repetition (i.e. size=max repetition_level+1) where each element
+* represents the number of times the repetition level was observed in the
+* data.
+*
+* This field may be omitted if max_repetition_level is 0.
+**/
+   1: optional list repetition_level_histogram;
+   /**
+* Same as repetition_level_histogram except for definition levels.
+*
+* This field may be omitted if max_definition_level is 0 or 1.
+**/
+   2: optional list definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+* The number of physical bytes stored for BYTE_ARRAY data values assuming
+* no encoding. This is exclusive of the bytes needed to store the length of
+* each byte array. In other words, this field is equivalent to the `(size
+* of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+* written)`. To determine unencoded sizes of other types readers can use
+* schema information multiplied by the number of non-null and null values.
+* The number of null/non-null values can be inferred from the histograms
+* below.
+*
+* For example, if a column chunk is dictionary-encoded with dictionary
+* ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+* then this value for that data page should be 7 (1 + 1 + 2 + 3).
+*
+* This field should only be set for types that use BYTE_ARRAY as their
+* physical type.
+*/
+   1: optional i64 unencoded_byte_array_data_bytes;

Review Comment:
   I guess in another case, like we're now trying to decode Dictionary Encoded 
Column to in-memory dictionary (like arrow/arrow-rs and Velox does), this might 
be a bit tricky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318413890


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,73 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+* When present, there is expected to be one element corresponding to each
+* repetition (i.e. size=max repetition_level+1) where each element
+* represents the number of times the repetition level was observed in the
+* data.
+*
+* This field may be omitted if max_repetition_level is 0.
+**/
+   1: optional list repetition_level_histogram;
+   /**
+* Same as repetition_level_histogram except for definition levels.
+*
+* This field may be omitted if max_definition_level is 0 or 1.
+**/
+   2: optional list definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+* The number of physical bytes stored for BYTE_ARRAY data values assuming
+* no encoding. This is exclusive of the bytes needed to store the length of
+* each byte array. In other words, this field is equivalent to the `(size
+* of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+* written)`. To determine unencoded sizes of other types readers can use
+* schema information multiplied by the number of non-null and null values.
+* The number of null/non-null values can be inferred from the histograms
+* below.
+*
+* For example, if a column chunk is dictionary-encoded with dictionary
+* ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+* then this value for that data page should be 7 (1 + 1 + 2 + 3).
+*
+* This field should only be set for types that use BYTE_ARRAY as their
+* physical type.
+*/
+   1: optional i64 unencoded_byte_array_data_bytes;

Review Comment:
   Nice catch, I guess @etseidl would like to decompress-and-decode 
pre-allocate the page once, since they're on GPU?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318347248


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,73 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+* When present, there is expected to be one element corresponding to each
+* repetition (i.e. size=max repetition_level+1) where each element
+* represents the number of times the repetition level was observed in the
+* data.
+*
+* This field may be omitted if max_repetition_level is 0.
+**/
+   1: optional list repetition_level_histogram;
+   /**
+* Same as repetition_level_histogram except for definition levels.
+*
+* This field may be omitted if max_definition_level is 0 or 1.
+**/
+   2: optional list definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+* The number of physical bytes stored for BYTE_ARRAY data values assuming
+* no encoding. This is exclusive of the bytes needed to store the length of
+* each byte array. In other words, this field is equivalent to the `(size
+* of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+* written)`. To determine unencoded sizes of other types readers can use
+* schema information multiplied by the number of non-null and null values.
+* The number of null/non-null values can be inferred from the histograms
+* below.
+*
+* For example, if a column chunk is dictionary-encoded with dictionary
+* ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+* then this value for that data page should be 7 (1 + 1 + 2 + 3).
+*
+* This field should only be set for types that use BYTE_ARRAY as their
+* physical type.
+*/
+   1: optional i64 unencoded_byte_array_data_bytes;

Review Comment:
   > It also seems off to me that it assumes readers can't preserve dictionaries
   
   Hmmm right maybe dictionary size should be take into account. It would be ok 
for read dictionary into Plain data(like Binary / String), but read into 
dictionary indices would require just dictionary raw-binary size. Should we 
just reserve some info for dictionary in `ColumnChunk` @emkornfield (Or this is 
not required)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-06 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317589791


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,74 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in
+  *   memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+* When present, there is expected to be one element corresponding to each
+* repetition (i.e. size=max repetition_level+1) where each element

Review Comment:
   if max-rep-level == 1,  it could has rep==1 and rep==2.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303307879


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   I think `list` (or optional) for uncompressed size with same length as 
"OffsetIndex" is suitable for you case. You can draft first since this patch 
need 2 language impl as poc, so might take some time. And you can first store 
in key-value-metadata to test if this works :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303273552


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Hmm first of all, PageIndex might not a "footer", because it has some 
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for 
column and offset index)
   
   Estimate batch size is important, however I wonder a page-level statistics 
in "index" or "footer" might be a bit weird(because we might have it in 
per-page). If you want it, I think you can try to draft a new pull request in 
this repo, and maybe put the statistics in footer or index.
   
   I've searched in the project:
   
   1. `OffsetIndex` has a compressed-size, but actually it's for IO. 
   2. `ColumMetadata` has an ` encoding_stats`, but it's for every encoding
   
   I think the 1-2 are both not perfect suitable here. And as-for user defined 
extended info, we can even encode the user-defined stats in 
`key_value_metadata` as base64 or base86 string
   
   Welcome to draft a pull-request in this project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303273552


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Hmm first of all, PageIndex might not a "footer", because it has some 
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for 
column and offset index)
   
   Estimate batch size is important, however I wonder a page-level statistics 
in "index" or "footer" might be a bit weird(because we might have it in 
per-page). If you want it, I think you can try to draft a new pull request in 
this repo, and maybe put the statistics in footer or index.
   
   I've searched in the project:
   
   1. `OffsetIndex` has a compressed-size, but actually it's for IO. 
   2. `ColumMetadata` has an ` encoding_stats`, but it's for every encoding
   
   Welcome to draft here. And we can even encode the user-defined stats in 
`key_value_metadata` as base64 or base86 string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303273552


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Hmm first of all, PageIndex might not a "footer", because it has some 
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for 
row-group)
   
   Estimate batch size is important, however I wonder a page-level statistics 
in "index" or "footer" might be a bit weird(because we might have it in 
per-page). If you want it, I think you can try to draft a new pull request in 
this repo, and maybe put the statistics in footer or index.
   
   I've searched in the project:
   
   1. `OffsetIndex` has a compressed-size, but actually it's for IO. 
   2. `ColumMetadata` has an ` encoding_stats`, but it's for every encoding
   
   Welcome to draft here. And we can even encode the user-defined stats in 
`key_value_metadata` as base64 or base86 string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302631559


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Personally I don't think we need a uncompressed size in `OffsetIndex` or any 
`PageIndex`. If that, maybe a lots of `Statistics` like encoding, compression 
etc would all be a part of `OffsetIndex`, because once you want to estimate, 
encoding, compression will also matters. I think it's kind of changing 
`OffsetIndex` to `PageHeader` Group. So currently I think maybe build it in 
`OffsetIndex` is not a good idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302631559


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Personally I don't think we need a uncompressed size in `OffsetIndex` or any 
`PageIndex`. If that, maybe a lots of `Statistics` like encoding, compression 
etc would all be a part of `OffsetIndex`, because once you want to estimate, 
encoding, compression will also matters. I think it's kind of changing 
`OffsetIndex` to `PageHeader` Group. So currently I'm not sure why we need this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302631559


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Personally I don't think we need a uncompressed size in `OffsetIndex` or any 
`PageIndex`. If that, maybe a lots of `Statistics` like encoding, compression 
etc would all be a part of `OffsetIndex`, because once you want to estimate, 
encoding, compression will also matters. I think it's kind of changing 
`OffsetIndex` to `PageHeader` Group. So currently I'm -1 for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-23 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302631559


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   Personally I don't think we need a uncompressed size in `OffsetIndex` or any 
`PageIndex`. If that, maybe a lots of `Statistics` like encoding, compression 
etc would all be a part of `OffsetIndex`. I think it's kind of changing 
`OffsetIndex` to `PageHeader` Group. So currently I'm -1 for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302020927


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   (But to be honest...If `max rep == 0, max def == 1`, this is useless, and I 
think the configuration here would be a bit hacking...But I can try to POC it 
and see the cost of it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302020927


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   (But to be honest...If `max rep == 0, max def == 1`, this is useless, and I 
think the configuration here would be a bit hacking...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302011495


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   I think `SizeEstimationStatistics` is already in `ColumnChunk`, would a page 
level `unencoded_variable_width_stored_bytes` helps pruning? Since here the 
`repetition_definition_level_histograms` can help pruning like 
`struct.list_child == null`?
   
   If we want this it's even more better to put 
`unencoded_variable_width_stored_bytes` into `OffsetIndex`. It's much more like 
`compressed_page_size` there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302018886


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,64 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or 
column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 

Review Comment:
   Oh I got it, since we can deduce that the size of parent layer...Thats great



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302013630


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,64 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or 
column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *  list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+ * When present, there is expected to be one element corresponding
+ to each repetition (i.e. size=max repetition_level+1) 

Review Comment:
   ```suggestion
* to each repetition (i.e. size=max repetition_level+1) 
   ```
   
   I'm not sure would this be better or not...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302013630


##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,64 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or 
column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 
+  *   2. For filter push-down on nulls at various levels of nested structures 
and 
+  *  list lengths.
+  */ 
+struct RepetitionDefinitionLevelHistogram {
+   /** 
+ * When present, there is expected to be one element corresponding
+ to each repetition (i.e. size=max repetition_level+1) 

Review Comment:
   ```suggestion
* to each repetition (i.e. size=max repetition_level+1) 
   ```



##
src/main/thrift/parquet.thrift:
##
@@ -191,6 +191,64 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or 
column chunk. 
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory 

Review Comment:
   (2) can be easily done with these statistics, but I wonder how can we 
estimate the size of data with rep-def histogram...Is there any formulas?



##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1052,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the

Review Comment:
   ```suggestion
   * This contains some redundancy with null_counts, however, to 
accommodate the
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302011495


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   I think `SizeEstimationStatistics` is already in `ColumnChunk`, would a page 
level `unencoded_variable_width_stored_bytes` helps pruning?
   
   If we want this it's even more better to put 
`unencoded_variable_width_stored_bytes` into `OffsetIndex`. It's much more like 
`compressed_page_size` there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-22 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302011495


##
src/main/thrift/parquet.thrift:
##
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+  /** 
+* Repetition and definition level histograms for the pages.  
+*
+* This contains some redundancy with null_counts, however, to accommodate  
the
+* widest range of readers both should be populated.
+   **/
+  6: optional list 
repetition_definition_level_histograms; 

Review Comment:
   I think `SizeEstimationStatistics` is already in `ColumnChunk`, would a page 
level `unencoded_variable_width_stored_bytes` helps pruning?
   
   If we want this it's even more better to put 
`unencoded_variable_width_stored_bytes` into `OffsetIndex`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org