[GitHub] [iceberg] shangxinli commented on a change in pull request #1566: Parquet: Support Page Skipping in Iceberg Parquet Reader

GitBox Wed, 28 Oct 2020 07:31:24 -0700


shangxinli commented on a change in pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#discussion_r513490614




##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java
##########
@@ -130,13 +139,23 @@ private void advance() {
 
       PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
+        // Because of the issue of PARQUET-1901, we cannot blindly call 
readNextFilteredRowGroup()
+        if (hasRecordFilter) {
+          pages = reader.readNextFilteredRowGroup();
+        } else {
+          pages = reader.readNextRowGroup();
+        }
       } catch (IOException e) {
         throw new RuntimeIOException(e);
       }
 
+      long blockRowCount = blocks.get(nextRowGroup).getRowCount();
+      Preconditions.checkState(blockRowCount >= pages.getRowCount(),
+              "Number of values in the block, %s, does not great or equal 
number of values after filtering, %s",
+              blockRowCount, pages.getRowCount());
       long rowPosition = rowGroupsStartRowPos[nextRowGroup];

Review comment:
       @shardulm94, I talked to the Parquet community(PARQUET-1927) and it 
seems we don't need Parquet change anymore. What we can do is to use 
ParquetFileReader constructor's filter(page level stats, row group stats, 
dictionray and future bloomfilter). All the following calls like 
reader.getRowGroups() and reader.getFilteredRecordCount() in ReadConf are all 
filtered values. It also simplifies the ReadConf constructor and 
ParquetReader/VectorizedParquetReader. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shangxinli commented on a change in pull request #1566: Parquet: Support Page Skipping in Iceberg Parquet Reader

Reply via email to