shangxinli commented on a change in pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#discussion_r507854338



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java
##########
@@ -130,13 +139,23 @@ private void advance() {
 
       PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
+        // Because of the issue of PARQUET-1901, we cannot blindly call 
readNextFilteredRowGroup()
+        if (hasRecordFilter) {
+          pages = reader.readNextFilteredRowGroup();
+        } else {
+          pages = reader.readNextRowGroup();
+        }
       } catch (IOException e) {
         throw new RuntimeIOException(e);
       }
 
+      long blockRowCount = blocks.get(nextRowGroup).getRowCount();
+      Preconditions.checkState(blockRowCount >= pages.getRowCount(),
+              "Number of values in the block, %s, does not great or equal 
number of values after filtering, %s",
+              blockRowCount, pages.getRowCount());
       long rowPosition = rowGroupsStartRowPos[nextRowGroup];

Review comment:
       @shardulm94, after looking into the code deeper, I think there could be 
two solutions here. 1) Change Parquet to let readNextFilteredGroup() return not 
only the filtered pages but also the record counts it skipped. 2) In the 
ReadConf constructor, when deciding 'shouldRead', we add ColumnIndex filter. 
This will ensure if a row group is called by readNextFilteredGroup(), there 
will always be at least 1 page returned and no implicitly advance internally 
without Iceberg knowing it. Here is the place for determining the 'shouldRead' 
boolean variable. 
https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L106
   
   For option 1, we will need to wait for the new release of Parquet. I am not 
sure releasing 1.10.2 is an option for Parquet. If not, we have to wait for 
Parquet 1.12.0. For option 2, it requires a pretty significant change. And we 
have to make sure the implementation here is consistent with Parquet' 
ColumnIndex filter implementation. 
   
   Please share your thought and we can discuss. 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to