[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Thu, 07 Sep 2023 01:35:15 -0700


JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   
   > For an individual file, I'm not sure the added complexity is worth the 
savings.
   
   Sidenote: I am maintaining a proprietary implementation and I honestly think 
that the savings would be worth it for our use case. And our use case is not 
super special (using Iceberg for storing insane amounts of data, both in the 
number of rows and in the number of individual tables); I guess a lot of others 
have a similar use case, so I guess the savings will be worth it for many. Let 
me explain why I think this is the case:
   
   Not only does a list-of-lists design take up more space, also 
thrift-decoding a list of integers is way faster than thrift-decoding a list of 
structs of lists. E.g., this is the C++ code for decoding the `null_counts` of 
the column index (I used a proprietary thrift-compiler that creates a bit 
different code than the default implementation, but it's effectively 
equivalent).
   
   ```suggestion
                   xfer += iprot->readListBegin(type, size);
                   this->null_counts = container.allocateSpan<int64_t>(size);
                   for (uint32_t i = 0; i < size; ++i) {
                      xfer += iprot->readI64(this->null_counts[i]);
                   }
                   xfer += iprot->readListEnd();
   ```
   
   So the hot loop is as efficient as it gets: It just reads I64s. With structs 
and nested lists, you have to:
   
   * allocate lists & structs
   * read and check the isset flags for all nested optional fields (both lists 
are declared optional)
   * Recurse between the different read implementations
   
   You could argue that this is still negligible time. But you're doing this 
for each row group and maybe for multiple columns of a row group, so you can 
easily run this code 100000 times when processing larger tables. Especially 
with formats like DeltaLake and Iceberg, files are often smaller than we would 
wish (as e.g., a small insert generates a small file for just that insert. 
Thus, we sometimes have row groups with just 100s of rows in them. For such row 
groups, thrift-decoding the metadata can actually become slower than reading 
the actual rows in some cases. So thrift-decoding performance does matter, at 
least in my application.
   
   > If one wishes to cache the indexes to speed up queries, the additional 
memory could be a concern
   
   It just so happens that my application does this :). So yes, the more memory 
we waste, the less indexes we can keep in memory, the fewer cache hits we have, 
which would be a shame if it was preventable.
   
   Yes, we could store the thrift-encoded data. But as layed out above, 
thrift-decoding also becomes more costly with a list-of-lists design. A 
ColumnIndex is, e.g., used for a point access (equality predicate on a sorted, 
partitioned table) and in this case decoding speed matters, as users expect a 
very quick answer for such a query that returns a single row.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to