etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317881962
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
I'm not disagreeing with you @JFinis, but allow me to play devil's advocate.
It seems that lately the consensus is moving towards keeping the implementation
conceptually simple at the expense of some increased overhead. While the effect
of the new structures on the index sizes can be considerable, in terms of
overall file size, the effect of the new structures is truly negligible. Your
proposal is instead leading in the opposite direction, adding complexity to the
implementation to save on overhead. And it adds the additional complexity of
changing the indexing of the ColumnIndex members for a subset of fields; users
will have to remember that page index is not the major index for the
histograms. For an individual file, I'm not sure the added complexity is worth
the savings.
One use case I can think of that could cause problems is one in which you
have a large set of hive partitioned files. If the files number in the
thousands, then the aggregate cost of reading the page indexes could start to
add up quickly. But if one uses a parallel query engine like spark to consume
the files, this additional overhead shouldn't be too bad. If one wishes to
cache the indexes to speed up queries, the additional memory could be a
concern. But one could counter for that case that perhaps simply materializing
the indexes as thrift objects is not the most performant approach, and you
would be better served reading the indexes and then transforming them into a
more memory efficient form to keep in RAM.
I think in terms of the 3 options from
https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992, this
lies somewhere between option 2 and option 3. The most recent proposal from
@emkornfield is to move in the opposite direction towards option 1. Your
proposal isn't bad (quite the opposite, I think it's clever), just swimming
upstream a bit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]