[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Gabor Szadovszky (Jira) Tue, 27 Oct 2020 02:01:02 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221260#comment-17221260
 ]


Gabor Szadovszky commented on PARQUET-1927:
-------------------------------------------

{quote}ParquetFileReader.getFilteredRecordCount() cannot be used because 
Iceberg applied RowGroup stats filter and Dcitionary filter also.{quote}
I don't see why it is a problem. {{ParquetFileReader}} filters the row groups 
based on stats and dictionary (and bloom filters from 1.12) in the constructor 
so {{getFilteredRecordCount}} will be executed on the filtered row groups. I am 
curious why the currently available values are not suitable for iceberg. 
parquet-mr high level API (the record readers) works based on these and if they 
are not correct for iceberg it might highlight some issues inside parquet-mr as 
well. (I don't think this is the case, though. We have a lot of unit tests in 
the different API levels.)

{quote}I think what we can do is to make getRowRanges() public.{quote}
I would rather not make this public. The object {{RowRanges}} is not designed 
for public use. If it is really necessary I would provide the required values 
instead.

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to