[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Xinli Shang (Jira) Wed, 21 Oct 2020 07:50:41 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218325#comment-17218325
 ]


Xinli Shang commented on PARQUET-1927:
--------------------------------------

The workaround I can think of is to apply ColumnIndex to row groups, something 
like (columnIndex, rowGroup) => recordCount, before calling 
readNextFilteredRowGroup() in Iceberg. If recordCount is 0, we skip calling 
readNextFilteredRowGroup() for that row group. By doing this way, it is ensured 
that readNextFilteredRowGroup() will never advance to the next row group 
without Iceberg's knowledge. But this workaround has several issues. 1) It is 
not a trivial implementation because we need to implement all types of filters 
against columnIndex, which pretty much duplicate the implementation in Parquet. 
2) The two implementations(in Parquet and in Iceberg) have to be consistent. If 
one has issues, it will cause Iceberg to be in an unknown state. 3) It requires 
other adoption like Hive, Spark to reimplement their own too.  

This is not regression because ColumnIndex is a new feature in 1.11.x. But I 
think releasing 1.11.2 would be better because it helps the adoption of 1.11.x  
as the ColumnIndex feature is one of the major features in 1.11.x. 

 

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to