JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318367540
########## src/main/thrift/parquet.thrift: ########## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate + * the widest range of readers both should be populated when either the max + * definition and repetition level meet the requirements specified in + * RepetitionDefinitionLevelHistogram. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms Review Comment: > @gszadovszky was item one a consideration in the original index design (e.g. why wasn't a list of structs used?), it would be good to be consistent with the original philosophy here. I would guess that especially memory locality was a design consideration that led to a columnar design. The indexes are supposed to be used by either linear or binary search. With these access patterns, the the operation will be **way** faster if we have one contiguous block of memory. Would we have an indirection through structs or lists, we would basically have a cache miss for each access, even in a linear search. That can make this search slower by orders of magnitude. As an index is supposed to be a structure to perform look-ups swiftly, an order of magnitude slow-down on look-up is not good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org