wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317993757
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
Just want to add that usually the implementation does not directly expose
thrift object to the end user and instead it wraps or converts to a
language-native form. Using a nested list thrift definition may involve more
conversion if a page-level histogram is required and in turn introduces more
memory overhead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]