nathanb9 opened a new pull request, #22873:
URL: https://github.com/apache/datafusion/pull/22873
## Problem
Per SQL:2016 ยง7.10, a `USING` / `NATURAL` join column is `COALESCE(left.k,
right.k)`. For a right-only row of a `RIGHT` / `FULL` outer join the left key
is NULL padded, so the merged key must take the right key's value. DataFusion
instead resolves the unqualified key to the left column only, so it comes out
`NULL` for right-only rows.
This is silent (the query plans and runs fine) and it corrupts downstream
`WHERE` and `GROUP BY` on the key. `LEFT JOIN USING` / `INNER` are unaffected:
the defect only appears when the left side is NULL padded (`RIGHT` / `FULL`).
The logical plan binds the merged key to the left column with no `COALESCE`:
```
> explain format indent select k from a full join b using (k);
Projection: a.k <-- unqualified k bound to a.k (not
coalesce(a.k, b.k))
Full Join: Using a.k = b.k
TableScan: a projection=[k]
TableScan: b projection=[k]
```
## Proof
```sql
create table a(k int, x int) as values (1, 10);
create table b(k int, y int) as values (4, 400);
-- right-only row reports k = NULL instead of 4
select k from a full join b using (k) order by k nulls last;
-- before: 1, NULL after: 1, 4
-- the unqualified `k` is bound to a.k, not coalesce(a.k, b.k)
select k as unqual_k, a.k as a_k, b.k as b_k
from a full join b using (k) order by unqual_k nulls last;
-- before: unqual_k row is NULL | NULL | 4 after: 4 | NULL | 4
-- same defect for RIGHT JOIN USING and NATURAL RIGHT/FULL JOIN
select k from a right join b using (k); -- before: NULL
after: 4
select k from a natural full join b order by k nulls last; -- before:
1,NULL after: 1,4
```
Downstream impact, also fixed:
```sql
-- before: 0 rows (silently drops a real row); after: 1 row
select * from a full join b using (k) where k = 4;
-- before: right-only rows bucket under k=NULL; after: under their real key
select k, count(*) from a full join b using (k) group by k;
```
The standard-equivalent explicit form was already correct, confirming the
divergence:
```sql
select coalesce(a.k, b.k) as k, a.x, b.y
from a full join b on a.k = b.k order by k nulls last;
-- 1, 10, NULL
-- 4, NULL, 400 <-- k = 4 here (correct)
```
## Solution
DataFusion keeps both per-side keys (`left.k` and `right.k`) in the join
output schema so they stay individually addressable, and `DFSchema` forbids an
unqualified field coexisting with a same-named qualified one. So rather than
materializing a merged key column, the merged key is exposed by resolving an
*unqualified* reference to the coalesced value, leaving the join schema
untouched.
When planning a `RIGHT` / `FULL` `USING` / `NATURAL` join, `plan_using_join`
registers each merged key in `PlannerContext` with its replacement expression:
`COALESCE(left.k, right.k)` for `FULL`, or `right.k` for `RIGHT` (the right
side is always present). An unqualified reference to the key is then rewritten
to that replacement in `sql_identifier_to_expr` (covering `SELECT` / `WHERE` /
`GROUP BY` / `HAVING` / `ORDER BY` at one chokepoint), guarded so it only fires
where both per-side keys are in scope. `SELECT *` substitutes the coalesced
value for the kept key column, and the key name is preserved as the output
column name. The registry is scoped per `SELECT` so merged keys do not leak to
sibling queries (e.g. the other arm of a `UNION`).
`INNER` / `LEFT` joins are deliberately left unchanged: their left key is
never NULL padded, so resolution to the left column is already correct and the
common path keeps its existing plan.
The result now matches PostgreSQL and the explicit `SELECT coalesce(a.k,
b.k) ... ON a.k = b.k` form for `RIGHT`, `FULL`, `NATURAL RIGHT`, and `NATURAL
FULL` joins:
```
> explain format indent select k from a full join b using (k);
Projection: coalesce(a.k, b.k) AS k
Full Join: Using a.k = b.k
TableScan: a projection=[k]
TableScan: b projection=[k]
```
There was previously **no sqllogictest coverage** for `RIGHT` / `FULL`
`USING` / `NATURAL`, which is why this went unnoticed. This PR adds that
coverage in `joins.slt` (data results, downstream `WHERE` / `GROUP BY`, `SELECT
*`, and the logical/physical plans, including the unchanged `INNER` plan) plus
planner unit tests in `sql_integration.rs`. The whole `sqllogictest` suite and
all `datafusion-sql` tests pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]