JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
> For an individual file, I'm not sure the added complexity is worth the
savings.
Sidenote: I am maintaining a proprietary implementation and I honestly think
that the savings would be worth it for our use case. And our use case is not
super special (using Iceberg for storing insane amounts of data, both in the
number of rows and in the number of individual tables); I guess a lot of others
have a similar use case, so I guess the savings will be worth it for many. Let
me explain why I think this is the case:
Not only does a list-of-lists design take up more space, also
thrift-decoding a list of integers is way faster than thrift-decoding a list of
structs of lists. E.g., this is the C++ code for decoding the `null_counts` of
the column index (I used a proprietary thrift-compiler that creates a bit
different code than the default implementation, but it's effectively
equivalent).
```suggestion
xfer += iprot->readListBegin(type, size);
this->null_counts = container.allocateSpan<int64_t>(size);
for (uint32_t i = 0; i < size; ++i) {
xfer += iprot->readI64(this->null_counts[i]);
}
xfer += iprot->readListEnd();
```
So the hot loop is as efficient as it gets: It just reads I64s. With structs
and nested lists, you have to:
* allocate lists & structs
* read and check the isset flags for all nested optional fields (both lists
are declared optional)
* Recurse between the different read implementations
You could argue that this is still negligible time. But you're doing this
for each row group and maybe for multiple columns of a row group, so you can
easily run this code 100000 times when processing larger tables. Especially
with formats like DeltaLake and Iceberg, files are often smaller than we would
wish (as e.g., a small insert generates a small file for just that insert.
Thus, we sometimes have row groups with just 100s of rows in them. For such row
groups, thrift-decoding the metadata can actually become slower than reading
the actual rows in some cases. So thrift-decoding performance does matter, at
least in my application.
> If one wishes to cache the indexes to speed up queries, the additional
memory could be a concern
It just so happens that my application does this :). So yes, the more memory
we waste, the less indexes we can keep in memory, the fewer cache hits we have,
which would be a shame if it was preventable.
Yes, we could store the thrift-encoded data. But as layed out above,
thrift-decoding also becomes more costly with a list-of-lists design. A
ColumnIndex is, e.g., used for a point access (equality predicate on a sorted,
partitioned table) and in this case decoding speed matters, as users expect a
very quick answer for such a query that returns a single row.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]