[ 
https://issues.apache.org/jira/browse/IMPALA-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986201#comment-17986201
 ] 

ASF subversion and git services commented on IMPALA-14123:
----------------------------------------------------------

Commit 1d640905912944ea05deaa3453cb6a85013b2e54 in impala's branch 
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1d6409059 ]

IMPALA-14123: Allow forcing predicate push down to Iceberg

Since IMPALA-11591 Impala tries to avoid  pushing down predicates to
Iceberg unless it is necessary (timetravel) or is likely to be useful
(at least 1 partition column is involved in predicates). While this
makes planning faster, it may miss opportunities to skip files during
planning.

This patch adds table property impala.iceberg.push_down_hint that
expects a comma separated list of column names and leads to push
down to Iceberg when there is a predicate on any of these columns.
Users can set this manually, while in the future Impala or other tools
may be able to set it automatically, e.g. during COMPUTE STATS if
there are many files with non-overlapping min/max stats for a given
column.

Note that in most cases when Iceberg can skip files the Parquet/ORC
scanner would also skip most of the data based on stat filtering. The
benefit of doing it during planning is reading less footers and a
"smaller" query plan.

Change-Id: I8eb4ab5204c20b3991fdf305d7317f4023904a0f
Reviewed-on: http://gerrit.cloudera.org:8080/22995
Reviewed-by: Csaba Ringhofer <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Allow Iceberg predicate push down on non-partition columns
> ----------------------------------------------------------
>
>                 Key: IMPALA-14123
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14123
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Major
>              Labels: iceberg, impala-iceberg, performance
>
> Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one 
> predicate can be pushed down for a partition column.  
> https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
> This is a useful optimization in general because if Iceberg can't skip many 
> files based on stats then calling planFiles() makes planning slower without 
> benefit. There are some cases though when many files could be skipped during 
> planning and doing this would make the query much faster. In most cases 
> Impala would still not read the whole data file, as Parquet min/max filtering 
> would drop row groups based on the footer, but knowing this during planning 
> would lead to a more efficient plan.
> Some cases where planning time stat filtering could be useful:
> 1. sorting columns
> 2. column with some hidden relation to sorting or partition columns
> 3. cases when Impala's scanner can't do file level stat filtering - the 
> obvious examples are Avro files, but Parquet/ORC min/max filtering also has 
> limitations
> some examples for 2 ("hidden partitioning/clustering"):
> a. table is partitioned by country of a point, filter is on x/y coordinates
> b. table is sorted by send_date, but filtered by receive_date - the two are 
> likely to correlate
>  
> My idea is to provide a way to configure how Impala behaves (with the 
> exception of explicit sorting columns, where this info is available):
> 1. create a query option that enables calling planFiles() in all cases - this 
> is also useful to make Impala more consistent with other engines, for example 
> when benchmarking running Impala with both values can give useful insights
> 2. add a table property that lists the columns where predicates should be 
> pushed down to Iceberg - this could be also computed automatically by a tool 
> that checks overlaps among file min/max ranges



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to