[jira] [Created] (PARQUET-1927) ColumnIndex should provide number of records skipped

Xinli Shang (Jira) Sat, 17 Oct 2020 07:43:57 -0700

Xinli Shang created PARQUET-1927:
------------------------------------

             Summary: ColumnIndex should provide number of records skipped 
                 Key: PARQUET-1927
                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
    Affects Versions: 1.11.0
            Reporter: Xinli Shang
             Fix For: 1.12.0



When integrating Parquet ColumnIndex, I found we need to know from Parquet that 
how many records that we skipped due to ColumnIndex filtering. When rowCount is 
0, readNextFilteredRowGroup() just advance to next without telling the caller. 
See code here 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]

 

In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
following code():

valuesRead + skippedValues < totalValues

See 
([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
 

So without knowing the skipped values, it is hard to determine hasNext() or 
not. 

 

Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
returns null, we consider it is done for the whole file. Then hasNext() just 
retrun false. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to