[GitHub] [iceberg] shardulm94 commented on a change in pull request #1566: Parquet: Support Page Skipping in Iceberg Parquet Reader

GitBox Thu, 15 Oct 2020 14:25:25 -0700


shardulm94 commented on a change in pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#discussion_r505861612




##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java
##########
@@ -130,13 +139,23 @@ private void advance() {
 
       PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
+        // Because of the issue of PARQUET-1901, we cannot blindly call 
readNextFilteredRowGroup()
+        if (hasRecordFilter) {
+          pages = reader.readNextFilteredRowGroup();
+        } else {
+          pages = reader.readNextRowGroup();
+        }
       } catch (IOException e) {
         throw new RuntimeIOException(e);
       }
 
+      long blockRowCount = blocks.get(nextRowGroup).getRowCount();
+      Preconditions.checkState(blockRowCount >= pages.getRowCount(),
+              "Number of values in the block, %s, does not great or equal 
number of values after filtering, %s",
+              blockRowCount, pages.getRowCount());
       long rowPosition = rowGroupsStartRowPos[nextRowGroup];

Review comment:
       After reading more about Parquet and page skipping, I don't think my 
comment above holds true in the general case. Seems like `nextRowGroup` counter 
should still accurately depict the current row group being processed. The only 
edge case that I see is in 
https://github.com/apache/parquet-mr/blob/fe9ab5d22c602776f44b027ad3cc05609eddba54/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L967
 where a whole row group may be skipped if the column index allows all the 
pages in the row group to be skipped. In such a case, 
`readNextFilteredRowGroup` will automatically advance to the next block without 
our knowledge and hence without the counter being updated. This will be an 
issue for all further row group skipping and row postition calculations in 
Iceberg's ParquetReader. However, I am not sure if there can be such a case 
where Iceberg's row group skipping is not able to skip the row group, but 
Parquet's page skipping ends put skipping all page and consequ
 ently the whole row group. 

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java
##########
@@ -130,13 +139,23 @@ private void advance() {
 
       PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
+        // Because of the issue of PARQUET-1901, we cannot blindly call 
readNextFilteredRowGroup()
+        if (hasRecordFilter) {
+          pages = reader.readNextFilteredRowGroup();

Review comment:
       I read more into the code and I think I understand the benefit here. So 
a Parquet row group is made up of multiple pages, previously we were only 
skipping unmatched row groups, now we can skip unmatched row groups and 
unmatched pages. Let me know if this seems correct.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1566: Parquet: Support Page Skipping in Iceberg Parquet Reader

Reply via email to