[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Gabor Szadovszky (Jira) Wed, 21 Oct 2020 01:04:32 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218176#comment-17218176
 ]


Gabor Szadovszky commented on PARQUET-1927:
-------------------------------------------

I think, it is fine extending the current API if it is required by one of our 
clients. Also, Iceberg might be the first one who already uses 1.11. This 
extension might be useful for others (e.g. Hive, Spark).

Releasing 1.11.2 depends on the issues we would like to fix in it. If they are 
regressions introduced in 1.11 and they are sever we clearly would like to 
release the fix in a maintenance release. So the question is if this issue is 
severe enough and doesn't have a proper workaround.

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to