[ 
https://issues.apache.org/jira/browse/PARQUET-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575562#comment-17575562
 ] 

Ivan Sadikov commented on PARQUET-2170:
---------------------------------------

I will update the description later and I would like to open a PR to fix the 
issue. I think we just need to check if the column set is empty or not when 
checking paths in the ColumnIndexFilter but I will need to confirm this.

> Empty projection returns the wrong number of rows when column index is enabled
> ------------------------------------------------------------------------------
>
>                 Key: PARQUET-2170
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2170
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Ivan Sadikov
>            Priority: Major
>
> Discovered in Spark, when returning an empty projection from a Parquet file 
> with filter pushdown enabled (typically when doing filter + count), 
> Parquet-Mr returns a wrong number of rows with column index enabled. When the 
> column index feature is disabled, the result is correct.
>  
> This happens due to the following:
>  # ParquetFileReader::getFilteredRowCount() 
> ([https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851)]
>  selects row ranges to calculate the row count when column index is enabled.
>  # In ColumnIndexFilter 
> ([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80)]
>  we filter row ranges and pass the set of paths which in this case is empty.
>  # When evaluating the filter, if the column path is not in the set, we would 
> return an empty list of rows 
> ([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178)|https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178).]
>  which is always the case for an empty projection.
>  # This results in the incorrect number of records reported by the library.
> I will provide the full repro later.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to