shangxinli commented on a change in pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#discussion_r507854338
##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java
##########
@@ -130,13 +139,23 @@ private void advance() {
PageReadStore pages;
try {
- pages = reader.readNextRowGroup();
+ // Because of the issue of PARQUET-1901, we cannot blindly call
readNextFilteredRowGroup()
+ if (hasRecordFilter) {
+ pages = reader.readNextFilteredRowGroup();
+ } else {
+ pages = reader.readNextRowGroup();
+ }
} catch (IOException e) {
throw new RuntimeIOException(e);
}
+ long blockRowCount = blocks.get(nextRowGroup).getRowCount();
+ Preconditions.checkState(blockRowCount >= pages.getRowCount(),
+ "Number of values in the block, %s, does not great or equal
number of values after filtering, %s",
+ blockRowCount, pages.getRowCount());
long rowPosition = rowGroupsStartRowPos[nextRowGroup];
Review comment:
@shardulm94, after looking into the code deeper, I think there could be
two solutions here. 1) Change Parquet to let readNextFilteredGroup() return not
only the filtered pages but also the record counts it skipped. 2) In the
ReadConf constructor, when deciding 'shouldRead', we add ColumnIndex filter.
This will ensure if a row group is called by readNextFilteredGroup(), there
will always be at least 1 page returned and no implicitly advance internally
without Iceberg knowing it. Here is the place for determining the 'shouldRead'
boolean variable.
https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L106
For option 1, we will need to wait for the new release of Parquet. I am not
sure releasing 1.10.2 is an option for Parquet. If not, we have to wait for
Parquet 1.12.0. For option 2, it requires a pretty significant change. And we
have to make sure the implementation here is consistent with Parquet'
ColumnIndex filter implementation.
Please share your thought and we can discuss.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]