JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318273187


##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   Concerning the point of implementation complexity: I do agree that ease of 
implementation is a factor. But I would argue that the columnar encoding isn't 
even that much more complex. The `ColumnIndex` is a column-major data structure 
by design; all other fields are stored column-mjaor. The current design breaks 
with it and would make it a half-column-major, half-row-major data structure 
and I would argue that this mixing of designs makes the code actually more 
complex.
   
   Yes, the layout would no longer be exactly the same as in `Statistics`, but 
this is already true for all the other fields in the `ColumnIndex` (e.g., 
min/max are also not in a struct in the `ColumnIndex`).
   
   Finally, if one wants to use the same code to handle `Statistics` and the 
`ColumnIndex`, they can write code to transform one into the other in a few 
lines of code. When iterating through the pages, they can even re-use the same 
lists when doing this transformation, so they don't need to create 200+ lists 
but can just retain one pair of definition level and repetition level lists and 
just fill them from the next page. I would argue that this is not really on a 
level of complexity where we should be afraid that it is too complex (given 
that people have to understand things like Dremel encoding when handling these 
lists, which is a whole different level of complexity).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to