[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Gabor Szadovszky (Jira) Mon, 26 Oct 2020 02:52:01 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220590#comment-17220590
 ]


Gabor Szadovszky commented on PARQUET-1927:
-------------------------------------------

[[email protected]], sorry for keep bothering with my ideas but it seems I still 
not get the concept.

As far as I understand iceberg keeps reading the rows until it reaches the 
total number of rows in the row group or the file (not sure which one). Both 
the numbers of (filtered) rows are available for the row group 
([PageReadStore.getRowCount()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java#L44])
 and the whole file 
([ParquetFileReader.getFilteredRecordCount()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L829]).
 I am not sure why you try to align the number of rows already read with the 
number of filtered rows instead of using the proper number for the total. 
(Instead of {{valuesRead + skippedValues < totalValues}} you may use 
{{valuesRead < totalFilteredValues}}.)
Of course if you have to use the number of total (filtered) rows in the file 
you have to calculate the filtering for all row groups before starting to read 
any value but you have to do it anyway so I don't think it should be a problem.

Meanwhile, if you think the API change is required I am happy to review the 
related PR.

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to