[ 
https://issues.apache.org/jira/browse/IMPALA-15111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-15111:
---------------------------------------
    Description: 
Currently, Impala only generates min/max runtime filters from equi-join 
predicates on hash joins (a.col = b.col). For inequality join predicates (e.g., 
a.col <= b.col), min/max filters are only supported when the right side is a 
non-correlated scalar subquery.

This means range-overlap join patterns like:

{noformat}
  SELECT ...
  FROM fact_table a JOIN dim_table b ON
    a.key = b.key AND
    a.start_time <= b.end_time AND
    a.end_time >= b.start_time
{noformat}
do not benefit from min/max runtime filters on start_time/end_time, even though 
the build side's min/max bounds could effectively prune row groups and pages on 
the probe side.

*Proposed Enhancement*:

  Extend RuntimeFilterGenerator to recognize inequality join predicates of the 
form probe.col {<=, <, >=, >} build.col on hash joins and generate min/max 
runtime filters for the probe-side column. Specifically:

  - For probe.col <= build.col: the build side computes max(build.col) and the 
probe side filters rows/row groups where probe.col > max(build.col).
  - For probe.col >= build.col: the build side computes min(build.col) and the 
probe side filters rows/row groups where probe.col < min(build.col).

This would enable Parquet row group/page-level pruning via overlap checks for 
range-join patterns, which are common in time-series and event-correlation 
workloads.

*Implementation Notes*:

  - In RuntimeFilterGenerator.java, the hash join path currently only calls 
isSqlEquivalencePredicate(). Add handling for BinaryPredicate with <=, <, >=, > 
operators where one side is bound to the build and the other to the probe.
  - Generate TRuntimeFilterType.MIN_MAX filters only (bloom filters don't apply 
to inequality predicates).
  - The build-side aggregation (computing min/max of the build column) is 
already how min/max filters work for equi-joins, the same mechanism can be 
reused.
  - On the scan side, the existing EvaluateOverlapForRowGroup() and page-level 
overlap logic should work as-is once the filter is generated and assigned to 
the scan node.


  was:
Currently, Impala only generates min/max runtime filters from equi-join 
predicates on hash joins (a.col = b.col). For inequality join predicates (e.g., 
a.col <= b.col), min/max filters are only supported when the right side is a 
non-correlated scalar subquery.

This means range-overlap join patterns like:

{noformat}
  SELECT ...
  FROM fact_table a
  INNER JOIN dim_table b ON a.key = b.key
    AND a.start_time <= b.end_time
    AND a.end_time >= b.start_time
{noformat}
do not benefit from min/max runtime filters on start_time/end_time, even though 
the build side's min/max bounds could effectively prune row groups and pages on 
the probe side.

*Proposed Enhancement*:

  Extend RuntimeFilterGenerator to recognize inequality join predicates of the 
form probe.col {<=, <, >=, >} build.col on hash joins and generate min/max 
runtime filters for the probe-side column. Specifically:

  - For probe.col <= build.col: the build side computes max(build.col) and the 
probe side filters rows/row groups where probe.col > max(build.col).
  - For probe.col >= build.col: the build side computes min(build.col) and the 
probe side filters rows/row groups where probe.col < min(build.col).

This would enable Parquet row group/page-level pruning via overlap checks for 
range-join patterns, which are common in time-series and event-correlation 
workloads.

*Implementation Notes*:

  - In RuntimeFilterGenerator.java, the hash join path currently only calls 
isSqlEquivalencePredicate(). Add handling for BinaryPredicate with <=, <, >=, > 
operators where one side is bound to the build and the other to the probe.
  - Generate TRuntimeFilterType.MIN_MAX filters only (bloom filters don't apply 
to inequality predicates).
  - The build-side aggregation (computing min/max of the build column) is 
already how min/max filters work for equi-joins, the same mechanism can be 
reused.
  - On the scan side, the existing EvaluateOverlapForRowGroup() and page-level 
overlap logic should work as-is once the filter is generated and assigned to 
the scan node.



> Generate min/max runtime filters for inequality join predicates on hash joins
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-15111
>                 URL: https://issues.apache.org/jira/browse/IMPALA-15111
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>
> Currently, Impala only generates min/max runtime filters from equi-join 
> predicates on hash joins (a.col = b.col). For inequality join predicates 
> (e.g., a.col <= b.col), min/max filters are only supported when the right 
> side is a non-correlated scalar subquery.
> This means range-overlap join patterns like:
> {noformat}
>   SELECT ...
>   FROM fact_table a JOIN dim_table b ON
>     a.key = b.key AND
>     a.start_time <= b.end_time AND
>     a.end_time >= b.start_time
> {noformat}
> do not benefit from min/max runtime filters on start_time/end_time, even 
> though the build side's min/max bounds could effectively prune row groups and 
> pages on the probe side.
> *Proposed Enhancement*:
>   Extend RuntimeFilterGenerator to recognize inequality join predicates of 
> the form probe.col {<=, <, >=, >} build.col on hash joins and generate 
> min/max runtime filters for the probe-side column. Specifically:
>   - For probe.col <= build.col: the build side computes max(build.col) and 
> the probe side filters rows/row groups where probe.col > max(build.col).
>   - For probe.col >= build.col: the build side computes min(build.col) and 
> the probe side filters rows/row groups where probe.col < min(build.col).
> This would enable Parquet row group/page-level pruning via overlap checks for 
> range-join patterns, which are common in time-series and event-correlation 
> workloads.
> *Implementation Notes*:
>   - In RuntimeFilterGenerator.java, the hash join path currently only calls 
> isSqlEquivalencePredicate(). Add handling for BinaryPredicate with <=, <, >=, 
> > operators where one side is bound to the build and the other to the probe.
>   - Generate TRuntimeFilterType.MIN_MAX filters only (bloom filters don't 
> apply to inequality predicates).
>   - The build-side aggregation (computing min/max of the build column) is 
> already how min/max filters work for equi-joins, the same mechanism can be 
> reused.
>   - On the scan side, the existing EvaluateOverlapForRowGroup() and 
> page-level overlap logic should work as-is once the filter is generated and 
> assigned to the scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to