Ivan Sadikov created PARQUET-2170:
-------------------------------------

             Summary: Empty projection returns the wrong number of rows when 
column index is enabled
                 Key: PARQUET-2170
                 URL: https://issues.apache.org/jira/browse/PARQUET-2170
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
            Reporter: Ivan Sadikov


Discovered in Spark, when returning an empty projection from a Parquet file 
with filter pushdown enabled (typically when doing filter + count), Parquet-Mr 
returns a wrong number of rows with column index enabled. When the column index 
feature is disabled, the result is correct.

 

This happens due to the following:
 # ParquetFileReader::getFilteredRowCount() 
([https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851)]
 selects row ranges to calculate the row count when column index is enabled.
 # In ColumnIndexFilter 
([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80)]
 we filter row ranges and pass the set of paths which in this case is empty.
 # When evaluating the filter, if the column path is not in the set, we would 
return an empty list of rows 
([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178)|https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178).]
 which is always the case for an empty projection.
 # This results in the incorrect number of records reported by the library.

I will provide the full repro later.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to