[ 
https://issues.apache.org/jira/browse/IMPALA-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-14123.
--------------------------------------
    Fix Version/s: Impala 5.0.0
       Resolution: Fixed

> Allow Iceberg predicate push down on non-partition columns
> ----------------------------------------------------------
>
>                 Key: IMPALA-14123
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14123
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Major
>              Labels: iceberg, impala-iceberg, performance
>             Fix For: Impala 5.0.0
>
>
> Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one 
> predicate can be pushed down for a partition column.  
> https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
> This is a useful optimization in general because if Iceberg can't skip many 
> files based on stats then calling planFiles() makes planning slower without 
> benefit. There are some cases though when many files could be skipped during 
> planning and doing this would make the query much faster. In most cases 
> Impala would still not read the whole data file, as Parquet min/max filtering 
> would drop row groups based on the footer, but knowing this during planning 
> would lead to a more efficient plan.
> Some cases where planning time stat filtering could be useful:
> 1. sorting columns
> 2. column with some hidden relation to sorting or partition columns
> 3. cases when Impala's scanner can't do file level stat filtering - the 
> obvious examples are Avro files, but Parquet/ORC min/max filtering also has 
> limitations
> some examples for 2 ("hidden partitioning/clustering"):
> a. table is partitioned by country of a point, filter is on x/y coordinates
> b. table is sorted by send_date, but filtered by receive_date - the two are 
> likely to correlate
>  
> My idea is to provide a way to configure how Impala behaves (with the 
> exception of explicit sorting columns, where this info is available):
> 1. create a query option that enables calling planFiles() in all cases - this 
> is also useful to make Impala more consistent with other engines, for example 
> when benchmarking running Impala with both values can give useful insights
> 2. add a table property that lists the columns where predicates should be 
> pushed down to Iceberg - this could be also computed automatically by a tool 
> that checks overlaps among file min/max ranges



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to