Stepan Stepanishchev created FLINK-38916:
--------------------------------------------
Summary: MultiJoin produces incorrect results for OR join
conditions when parallelism is greater than 1
Key: FLINK-38916
URL: https://issues.apache.org/jira/browse/FLINK-38916
Project: Flink
Issue Type: Bug
Components: Table SQL / Planner
Affects Versions: 2.2.0, 2.1.0, 2.0.0
Reporter: Stepan Stepanishchev
[This PRĀ |https://github.com/apache/flink/pull/27415] adds integration tests
that expose incorrect results produced by the MultiJoin operator for queries
with OR conditions in join predicates.
For example, the following query:
{code:sql}
SELECT u.user_id, u.name, o.order_id, p.payment_id, p.price
FROM Users u
JOIN Orders o ON u.user_id = o.user_id
JOIN Payments p ON o.user_id = p.user_id OR p.price = u.cash
{code}
produces incorrect results when executed with parallelism greater than 1 and
MultiJoin enabled.
The MultiJoin operator treats this query as if it has a common join key
(user_id). Therefore, the inputs are distributed by that key. As a result,
records that satisfy *p.price = u.cash* but have
*p.user_id != o.user_id* are assigned to different subtasks and never joined,
leading to missing result rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)