[
https://issues.apache.org/jira/browse/IMPALA-15111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-15111:
---------------------------------------
Description:
Currently, Impala only generates min/max runtime filters from equi-join
predicates on hash joins (a.col = b.col). For inequality join predicates (e.g.,
a.col <= b.col), min/max filters are only supported when the right side is a
non-correlated scalar subquery.
This means range-overlap join patterns like:
{noformat}
SELECT ...
FROM fact_table a
INNER JOIN dim_table b ON a.key = b.key
AND a.start_time <= b.end_time
AND a.end_time >= b.start_time
{noformat}
do not benefit from min/max runtime filters on start_time/end_time, even though
the build side's min/max bounds could effectively prune row groups and pages on
the probe side.
*Proposed Enhancement*:
Extend RuntimeFilterGenerator to recognize inequality join predicates of the
form probe.col {<=, <, >=, >} build.col on hash joins and generate min/max
runtime filters for the probe-side column. Specifically:
- For probe.col <= build.col: the build side computes max(build.col) and the
probe side filters rows/row groups where probe.col > max(build.col).
- For probe.col >= build.col: the build side computes min(build.col) and the
probe side filters rows/row groups where probe.col < min(build.col).
This would enable Parquet row group/page-level pruning via overlap checks for
range-join patterns, which are common in time-series and event-correlation
workloads.
*Implementation Notes*:
- In RuntimeFilterGenerator.java, the hash join path currently only calls
isSqlEquivalencePredicate(). Add handling for BinaryPredicate with <=, <, >=, >
operators where one side is bound to the build and the other to the probe.
- Generate TRuntimeFilterType.MIN_MAX filters only (bloom filters don't apply
to inequality predicates).
- The build-side aggregation (computing min/max of the build column) is
already how min/max filters work for equi-joins, the same mechanism can be
reused.
- On the scan side, the existing EvaluateOverlapForRowGroup() and page-level
overlap logic should work as-is once the filter is generated and assigned to
the scan node.
was:
Currently, Impala only generates min/max runtime filters from equi-join
predicates on hash joins (a.col = b.col). For inequality join predicates (e.g.,
a.col <= b.col), min/max filters are only supported when the right side is a
non-correlated scalar subquery.
This means range-overlap join patterns like:
{noformat}
SELECT ...
FROM fact_table a
INNER JOIN dim_table b ON a.key = b.key
AND a.start_time <= b.end_time
AND a.end_time >= b.start_time
{noformat}
do not benefit from min/max runtime filters on start_time/end_time, even though
the build side's min/max bounds could effectively prune row groups and pages on
the probe side.
*Proposed Enhancement*:
Extend RuntimeFilterGenerator to recognize inequality join predicates of the
form probe.col {<=, <, >=, >} build.col on hash joins and generate min/max
runtime filters for the probe-side column. Specifically:
- For probe.col <= build.col: the build side computes max(build.col) and the
probe side filters rows/row groups where probe.col > max(build.col).
- For probe.col >= build.col: the build side computes min(build.col) and the
probe side filters rows/row groups where probe.col < min(build.col).
This would enable Parquet row group/page-level pruning via overlap checks for
range-join patterns, which are common in time-series and event-correlation
workloads.
*Implementation Notes*:
- In RuntimeFilterGenerator.java, the hash join path currently only calls
isSqlEquivalencePredicate(). Add handling for BinaryPredicate with <=, <, >=, >
operators where one side is bound to the build and the other to the probe.
- Generate TRuntimeFilterType.MIN_MAX filters only (bloom filters don't apply
to inequality predicates).
- The build-side aggregation (computing min/max of the build column) is
already how min/max filters work for equi-joins — the same mechanism can be
reused.
- On the scan side, the existing EvaluateOverlapForRowGroup() and page-level
overlap logic should work as-is once the filter is generated and assigned to
the scan node.
> Generate min/max runtime filters for inequality join predicates on hash joins
> -----------------------------------------------------------------------------
>
> Key: IMPALA-15111
> URL: https://issues.apache.org/jira/browse/IMPALA-15111
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Priority: Major
>
> Currently, Impala only generates min/max runtime filters from equi-join
> predicates on hash joins (a.col = b.col). For inequality join predicates
> (e.g., a.col <= b.col), min/max filters are only supported when the right
> side is a non-correlated scalar subquery.
> This means range-overlap join patterns like:
> {noformat}
> SELECT ...
> FROM fact_table a
> INNER JOIN dim_table b ON a.key = b.key
> AND a.start_time <= b.end_time
> AND a.end_time >= b.start_time
> {noformat}
> do not benefit from min/max runtime filters on start_time/end_time, even
> though the build side's min/max bounds could effectively prune row groups and
> pages on the probe side.
> *Proposed Enhancement*:
> Extend RuntimeFilterGenerator to recognize inequality join predicates of
> the form probe.col {<=, <, >=, >} build.col on hash joins and generate
> min/max runtime filters for the probe-side column. Specifically:
> - For probe.col <= build.col: the build side computes max(build.col) and
> the probe side filters rows/row groups where probe.col > max(build.col).
> - For probe.col >= build.col: the build side computes min(build.col) and
> the probe side filters rows/row groups where probe.col < min(build.col).
> This would enable Parquet row group/page-level pruning via overlap checks for
> range-join patterns, which are common in time-series and event-correlation
> workloads.
> *Implementation Notes*:
> - In RuntimeFilterGenerator.java, the hash join path currently only calls
> isSqlEquivalencePredicate(). Add handling for BinaryPredicate with <=, <, >=,
> > operators where one side is bound to the build and the other to the probe.
> - Generate TRuntimeFilterType.MIN_MAX filters only (bloom filters don't
> apply to inequality predicates).
> - The build-side aggregation (computing min/max of the build column) is
> already how min/max filters work for equi-joins, the same mechanism can be
> reused.
> - On the scan side, the existing EvaluateOverlapForRowGroup() and
> page-level overlap logic should work as-is once the filter is generated and
> assigned to the scan node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]