[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 06 Sep 2023 15:11:53 -0700


etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317881962



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   I'm not disagreeing with you @JFinis, but allow me to play devil's advocate. 
It seems that lately the consensus is moving towards keeping the implementation 
conceptually simple at the expense of some increased overhead. While the effect 
of the new structures on the index sizes can be considerable, in terms of 
overall file size, the effect of the new structures is truly negligible. Your 
proposal is instead leading in the opposite direction, adding complexity to the 
implementation to save on overhead. And it adds the additional complexity of 
changing the indexing of the ColumnIndex members for a subset of fields; users 
will have to remember that page index is not the major index for the 
histograms. For an individual file, I'm not sure the added complexity is worth 
the savings.
   
   One use case I can think of that could cause problems is one in which you 
have a large set of hive partitioned files. If the files number in the 
thousands, then the aggregate cost of reading the page indexes could start to 
add up quickly. But if one uses a parallel query engine like spark to consume 
the files, this additional overhead shouldn't be too bad. If one wishes to 
cache the indexes to speed up queries, the additional memory could be a 
concern.  But one could counter for that case that perhaps simply materializing 
the indexes as thrift objects is not the most performant approach, and you 
would be better served reading the indexes and then transforming them into a 
more memory efficient form to keep in RAM.
   
   I think in terms of the 3 options from 
https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992, this 
lies somewhere between option 2 and option 3. The most recent proposal from 
@emkornfield is to move in the opposite direction towards option 1. Your 
proposal isn't bad (quite the opposite, I think it's clever), just swimming 
upstream a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to