clairemcginty commented on code in PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#discussion_r1920639113
##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt
visit(Contains<T> conta
indices -> IndexIterator.all(getPageCount()));
}
+ @Override
+ public PrimitiveIterator.OfInt visit(Size size) {
+ return IndexIterator.all(getPageCount());
Review Comment:
cool, I'll implement it 👍 To check my understanding, the rep- and def-level
histograms we have access to here are implemented as flat `List<Long>` and
represent [the levels for all pages concatenated
together](https://github.com/apache/parquet-format/blob/apache-parquet-format-2.10.0/src/main/thrift/parquet.thrift#L1054-L1066):
```java
/**
* Contains repetition level histograms for each page
* concatenated together. The repetition_level_histogram field on
* SizeStatistics contains more details.
*
* When present the length should always be (number of pages *
* (max_repetition_level + 1)) elements.
*
* Element 0 is the first element of the histogram for the first page.
* Element (max_repetition_level + 1) is the first element of the histogram
* for the second page.
**/
```
So I'll need to break up the flat lists into per-page histograms in order to
perform per-page filtering here. But a
[comment](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java#L664-L675)
in ColumnIndexBuilder indicates that we don't have access to
maxRepetitionLevel here.
I guess if all histograms across all pages are the same size and we know
that [{rep,def}LevelHistogram.size() % pageCount !=
0](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java#L664),
I could just divide total histogram size by pageCount to get the size of each
individual histogram?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]