clairemcginty commented on code in PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#discussion_r1920639113


##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt 
visit(Contains<T> conta
           indices -> IndexIterator.all(getPageCount()));
     }
 
+    @Override
+    public PrimitiveIterator.OfInt visit(Size size) {
+      return IndexIterator.all(getPageCount());

Review Comment:
   cool, I'll implement it 👍 To check my understanding, the rep- and def-level 
histograms we have access to here are implemented as flat `List<Long>` and 
represent [the levels for all pages concatenated 
together](https://github.com/apache/parquet-format/blob/apache-parquet-format-2.10.0/src/main/thrift/parquet.thrift#L1054-L1066):
   
   ```java
     /**
      * Contains repetition level histograms for each page
      * concatenated together.  The repetition_level_histogram field on
      * SizeStatistics contains more details.
      *
      * When present the length should always be (number of pages *
      * (max_repetition_level + 1)) elements.
      *
      * Element 0 is the first element of the histogram for the first page.
      * Element (max_repetition_level + 1) is the first element of the histogram
      * for the second page.
      **/
   ```
   
   So I'll need to break up the flat lists into per-page histograms in order to 
perform per-page filtering here. But a 
[comment](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java#L664-L675)
 in ColumnIndexBuilder indicates that we don't have access to 
maxRepetitionLevel here. 
   
   I guess if all histograms across all pages are the same size and we know 
that [{rep,def}LevelHistogram.size() % pageCount != 
0](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java#L664),
 I could just divide total histogram size by pageCount to get the size of each 
individual histogram?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to