[
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217525#comment-17217525
]
Gabor Szadovszky commented on PARQUET-1927:
-------------------------------------------
I get it now. Thanks for explaining.
I guess you already know about
[ParquetFileReader.getRecordCount()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L821]
and
[ParquetFileReader.getFilteredRecordCount()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L829].
These values are for the whole file and not for the actual row group so they
might not good for Iceberg but these are the ones parquet-mr uses at higher
levels.
> ColumnIndex should provide number of records skipped
> -----------------------------------------------------
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.11.0
> Reporter: Xinli Shang
> Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet
> that how many records that we skipped due to ColumnIndex filtering. When
> rowCount is 0, readNextFilteredRowGroup() just advance to next without
> telling the caller. See code here
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the
> following code():
> valuesRead + skippedValues < totalValues
> See
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>
> So without knowing the skipped values, it is hard to determine hasNext() or
> not.
>
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup()
> returns null, we consider it is done for the whole file. Then hasNext() just
> retrun false.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)