JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317578394
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
The way you encode the values here is wasteful, both in the thrift size but
even more in the deserialized size, as you store a lot of small lists. Given
that the column index can become big due to many pages, I would consider this
inefficiency an issue.
I would propose not grouping repetition & definition level into a struct,
especially in the column index, so we don't need a struct per page.
Furthermore, I would actually propose to even "turn the lists around", i.e.:
```
optional list<list<i64>> repetition_level_histogram
optional list<list<i64>> definition_level_histogram
```
Here, the **outer** list would actually be the list of levels and the
**inner** list the list of pages. So, if we have two possible levels and 5
pages, this would store e.g. `[[3,5,2,1,3],[4,5,6,7,8]]`.
This way, the encoding will be way tighter and also more columnar, helping
with efficient processing.
The reason for this is that most of the time, we will have way more pages
than levels. E.g., let's say we have 100 pages and 5 repetition and definition
levels.
With your current proposal, you would need 201 lists (one outer, 100 for
definition levels, 100 for repetition levels) and 100 structs to represent the
deserialized form. Especially in languages like Java, each of these objects has
a considerable overhead. E.g., an ArrayList has 16 bytes overhead and a struct
has 8, so you need a whopping 4016 bytes just to encode the structs and lists
without counting the elements in them.
With my proposed encoding, you would only store 12 lists (two outer, 5 inner
for definition levels, 5 inner for repetition levels) and no structs, having
only 192 bytes overhead.
In thrift, the overhead is lower, but it is still existent and costs space
and performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]