[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Thu, 07 Sep 2023 02:50:39 -0700


JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318367540



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   > @gszadovszky was item one a consideration in the original index design 
(e.g. why wasn't a list of structs used?), it would be good to be consistent 
with the original philosophy here.
   
   I would guess that especially memory locality was a design consideration 
that led to a columnar design. The indexes are supposed to be used by either 
linear or binary search. With these access patterns, the the operation will be 
**way** faster if we have one contiguous block of memory. Would we have an 
indirection through structs or lists, we would basically have a cache miss for 
each access, even in a linear search. That can make this search slower by 
orders of magnitude. As an index is supposed to be a structure to perform 
look-ups swiftly, an order of magnitude slow-down on look-up is not good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to