[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 06 Sep 2023 21:08:16 -0700


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318059860



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   I think it is worth potentially separating out two parts here:
   1.  thrift representation and its inherent serialization overhead.
   2. organization of the histograms.
   
   For 1, the optimal from a space perspective is a single list, as proposed in 
the 
[comment](https://github.com/apache/parquet-format/pull/197#discussion_r1317866553)
 above since we know the size of each histogram.  In terms of complexity I 
actually don't think it adds much but it is slightly less ergonomic to code 
against.  I'm actually somewhat inclined to take this approach if others are OK 
with it.
   
   2. organization of the histograms - If the goal on the page index is 
primarily filtering then organized as @JFinis suggests would likely be better, 
because most filters would likely be translated into a mix/max 
definition/repetition level and comparing that against a specific bucket in the 
histogram.  Of course this makes using repetition/definition levels on the page 
level harder to user for size estimation.  The other downside, I see is the 
organization doesn't maintain consistency with any of the other elements in the 
index.
   
   @gszadovszky was item one a consideration in the original index design (e.g. 
why wasn't a list of structs used?), it would be good to be consistent with the 
original philosophy here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to