JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318273187
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
Concerning the point of implementation complexity: I do agree that ease of
implementation is a factor. But I would argue that the columnar encoding isn't
even that much more complex. The `ColumnIndex` is a column-major data structure
by design; all other fields are stored column-mjaor. The current design breaks
with it and would make it a half-column-major, half-row-major data structure
and I would argue that this mixing of designs makes the code actually more
complex.
Yes, the layout would no longer be exactly the same as in `Statistics`, but
this is already true for all the other fields in the `ColumnIndex` (e.g.,
min/max are also not in a struct in the `ColumnIndex`).
Finally, if one wants to use the same code to handle `Statistics` and the
`ColumnIndex`, they can write code to transform one into the other in a few
lines of code. When iterating through the pages, they can even re-use the same
lists when doing this transformation, so they don't need to create 200+ lists
but can just retain one pair of definition level and repetition level lists and
just fill them from the next page. I would argue that this is not really on a
level of complexity where we should be afraid that it is too complex (given
that people have to understand things like Dremel encoding when handling these
lists, which is a whole different level of complexity).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]