JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317578394
########## src/main/thrift/parquet.thrift: ########## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate + * the widest range of readers both should be populated when either the max + * definition and repetition level meet the requirements specified in + * RepetitionDefinitionLevelHistogram. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms Review Comment: The way you encode the values here is wasteful, both in the thrift size but even more in the deserialized size, as you store a lot of small lists. I would propose not grouping repetition & definition level into a struct, especially in the column index, so we don't need a struct per page. Furthermore, I would actually propose to even "turn the lists around", i.e.: ``` optional list<list<i64>> repetition_level_histogram optional list<list<i64>> definition_level_histogram ``` Here, the outer list would actually be the list of levels and the inner list the list of pages. So, if we have two possible levels and 5 pages, this would store e.g. `[[3,5,2,1,3],[4,5,6,7,8]]`. This way, the encoding will be way tighter and also more columnar, helping with efficient processing. The reason for this is that most of the time, we will have way more pages than levels. E.g., let's say we have 100 pages and 5 repetition and definition levels. With your current proposal, you would need 201 lists (one outer, 100 for definition levels, 100 for repetition levels) and 100 structs to represent the deserialized form. Especially in languages like Java, each of these objects has a considerable overhead. With my proposed encoding, you would only store 12 lists (two outer, 5 inner for definition levels, 5 inner for repetition levels). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org