[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Xinli Shang (Jira) Tue, 20 Oct 2020 10:02:09 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217774#comment-17217774
 ]


Xinli Shang commented on PARQUET-1927:
--------------------------------------

That is correct [~gszadovszky]! We need a finer-grained filter count in row 
group level to let the iterator to use. Do you think it makes sense that we add 
the API for that? 

If yes, do you think we can release the 1.11.2 version? I see usually no more 
release after 1.xx.1. 

 

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to