shardulm94 commented on a change in pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#discussion_r505914331



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java
##########
@@ -130,13 +139,23 @@ private void advance() {
 
       PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
+        // Because of the issue of PARQUET-1901, we cannot blindly call 
readNextFilteredRowGroup()
+        if (hasRecordFilter) {
+          pages = reader.readNextFilteredRowGroup();
+        } else {
+          pages = reader.readNextRowGroup();
+        }
       } catch (IOException e) {
         throw new RuntimeIOException(e);
       }
 
+      long blockRowCount = blocks.get(nextRowGroup).getRowCount();
+      Preconditions.checkState(blockRowCount >= pages.getRowCount(),
+              "Number of values in the block, %s, does not great or equal 
number of values after filtering, %s",
+              blockRowCount, pages.getRowCount());
       long rowPosition = rowGroupsStartRowPos[nextRowGroup];

Review comment:
       This edge case is probably possible with a carefully crafted file. e.g. 
consider the min/max values of single column
   Row Group: `[1, 1000]`
   Pages: `[[1, 10], [900, 1000]]`
   And expression `column = 500` will match the row group but will not match 
any of the pages resulting in the next row group being read automatically 
without the `nextRowGroup` counter being updated




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to