JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317578394
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
+ * the widest range of readers both should be populated when either the max
+ * definition and repetition level meet the requirements specified in
+ * RepetitionDefinitionLevelHistogram.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms
Review Comment:
The way you encode the values here is wasteful, both in the thrift size but
even more in the deserialized size, as you store a lot of small lists. Given
that the column index can become big due to many pages, I would consider this
inefficiency an issue.
I would propose not grouping repetition & definition level into a struct,
especially in the column index, so we don't need a struct per page.
Furthermore, I would actually propose to even "turn the lists around", i.e.:
```
optional list<list<i64>> repetition_level_histogram
optional list<list<i64>> definition_level_histogram
```
Here, the **outer** list would actually be the list of levels and the
**inner** list the list of pages. So, if we have two possible levels and 5
pages, this would store e.g. `[[3,5,2,1,3],[4,5,6,7,8]]`.
This way, the encoding will be way tighter and also more columnar, helping
with efficient processing.
The reason for this is that most of the time, we will have way more pages
than levels. E.g., let's say we have 100 pages and 5 repetition and definition
levels.
With your current proposal, you would need 201 lists (one outer, 100 for
definition levels, 100 for repetition levels) and 100 structs to represent the
deserialized form. Especially in languages like Java, each of these objects has
a considerable overhead.
With my proposed encoding, you would only store 12 lists (two outer, 5 inner
for definition levels, 5 inner for repetition levels).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]