JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318260892
########## src/main/thrift/parquet.thrift: ########## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate + * the widest range of readers both should be populated when either the max + * definition and repetition level meet the requirements specified in + * RepetitionDefinitionLevelHistogram. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms Review Comment: > For page level encoding, if we are really concerned about this, what do you think about just having a flat list of size (max_level + 1) * number of pages. This would be the best in terms of memory compactness for memory optimizations. We could do that as well. That adds a tiny bit more of implementation complexity, but I would still argue it is not complex enough that we would need to worry. The arithmetic will be pretty simple and getting the values out will just be one or a few lines of code. My gut feeling is that we would still want column-major order in this list to stay in line with the rest of the column index, but the main issue is gone with this design, so the distinction isn't that important anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org