sgrebnov opened a new pull request, #21694:
URL: https://github.com/apache/datafusion/pull/21694

   ## Which issue does this PR close?
   
   Partially addresses https://github.com/apache/datafusion/issues/13156 (inner 
joins only; outer joins require additional work)
   
   When the DataFusion optimizer pushes filter predicates into `TableScan` 
nodes (e.g. via `FilterPushdown`), the unparser's 
`try_transform_to_simple_table_scan_with_filters` extracts those filters and 
then always folds them into the `JOIN ON` clause. This is problematic when the 
extracted filters contain subquery expressions (scalar subqueries, `IN`, 
`EXISTS`), because some SQL backends — notably BigQuery — reject subqueries 
inside `JOIN ON` predicates.
   
   This currently breaks 5 TPC-H queries (Q2, Q16, Q17, Q18, Q21) when unparsed 
SQL is sent to BigQuery.
   
   We did attempt to fix this (https://github.com/apache/datafusion/pull/13496) 
by moving **all** filters to `WHERE`, which broke `LEFT`/`RIGHT`/`FULL` join 
semantics (moving a filter from `ON` to `WHERE` changes the result for outer 
joins, as demonstrated in https://github.com/apache/datafusion/pull/13132).
   
   ## What changes are included in this PR?
   
   For **inner joins only**, `table_scan_filters` extracted by 
`try_transform_to_simple_table_scan_with_filters` are now placed in the `WHERE` 
clause instead of the `JOIN ON` clause. This is safe because `ON` and `WHERE` 
are semantically equivalent for inner joins.
   
   For non-inner joins (`LEFT`, `RIGHT`, `FULL`), the existing behavior is 
preserved — filters remain in `JOIN ON` — since moving them to `WHERE` would 
change query semantics.
   
   ## Are these changes tested?
   
   Yes.
   
   - Added a test case in `test_join_with_table_scan_filters` that constructs 
an inner join where the right side has a `table_scan_with_filters` containing a 
scalar subquery. Verifies the subquery predicate appears in `WHERE`, not `JOIN 
ON`.
   - Updated existing snapshots in `test_join_with_table_scan_filters` 
reflecting that `table_scan_filters` now appear in `WHERE` for inner joins.
   
   ## Are there any user-facing changes?
   
   SQL generated by the unparser for inner joins may now place `TableScan` 
pushdown filters in the `WHERE` clause instead of the `JOIN ON` clause (similar 
to changes in `test_join_with_table_scan_filters`)
   
   ## Alternatives considered
   
   An alternative approach considered is to introduce a 
`supports_subquery_in_join_predicate` dialect flag that only moves 
subquery-containing filters to `WHERE` when the dialect opts in (e.g. 
`BigQueryDialect`), preserving existing behavior for all other dialects.
   
   Example implementation: https://github.com/spiceai/datafusion/pull/151
   
   Current approach was chosen due to 
   - simplicity: no new dialect flag, fewer code paths, simple change.
   - performance (potentially): as noted in 
[#13156](https://github.com/apache/datafusion/issues/13156), placing filters in 
`WHERE` can trigger filter pushdown on the target backend, which is a potential 
performance win.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to