[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Thu, 07 Sep 2023 01:23:16 -0700


JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318260892



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   
   
   > For page level encoding, if we are really concerned about this, what do 
you think about just having a flat list of size (max_level + 1) * number of 
pages. This would be the best in terms of memory compactness for memory 
optimizations.
   
   We could do that as well. That adds a tiny bit more of implementation 
complexity, but I would still argue it is not complex enough that we would need 
to worry. The arithmetic will be pretty simple and getting the values out will 
just be one or a few lines of code.
   
   My gut feeling is that we would still want column-major order in this list 
to stay in line with the rest of the column index, but the main issue is gone 
with this design, so the distinction isn't that important anymore.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to