emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318059860
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
I think it is worth potentially separating out two parts here:
1. thrift representation and its inherent serialization overhead.
2. organization of the histograms.
For 1, the optimal from a space perspective is a single list, as proposed in
the
[comment](https://github.com/apache/parquet-format/pull/197#discussion_r1317866553)
above since we know the size of each histogram. In terms of complexity I
actually don't think it adds much but it is slightly less ergonomic to code
against. I'm actually somewhat inclined to take this approach if others are OK
with it.
2. organization of the histograms - If the goal on the page index is
primarily filtering then organized as @JFinis suggests would likely be better,
because most filters would likely be translated into a mix/max
definition/repetition level and comparing that against a specific bucket in the
histogram. Of course this makes using repetition/definition levels on the page
level harder to user for size estimation. The other downside, I see is the
organization doesn't maintain consistency with any of the other elements in the
index.
@gszadovszky was item one a consideration in the original index design (e.g.
why wasn't a list of structs used?), it would be good to be consistent with the
original philosophy here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]