nathanb9 opened a new pull request, #22873:
URL: https://github.com/apache/datafusion/pull/22873

   ## Problem
   
   Per SQL:2016 ยง7.10, a `USING` / `NATURAL` join column is `COALESCE(left.k, 
right.k)`. For a right-only row of a `RIGHT` / `FULL` outer join the left key 
is NULL padded, so the merged key must take the right key's value. DataFusion 
instead resolves the unqualified key to the left column only, so it comes out 
`NULL` for right-only rows.
   
   This is silent (the query plans and runs fine) and it corrupts downstream 
`WHERE` and `GROUP BY` on the key. `LEFT JOIN USING` / `INNER` are unaffected: 
the defect only appears when the left side is NULL padded (`RIGHT` / `FULL`).
   
   The logical plan binds the merged key to the left column with no `COALESCE`:
   
   ```
   > explain format indent select k from a full join b using (k);
   
   Projection: a.k                       <-- unqualified k bound to a.k (not 
coalesce(a.k, b.k))
     Full Join: Using a.k = b.k
       TableScan: a projection=[k]
       TableScan: b projection=[k]
   ```
   
   ## Proof
   
   ```sql
   create table a(k int, x int) as values (1, 10);
   create table b(k int, y int) as values (4, 400);
   
   -- right-only row reports k = NULL instead of 4
   select k from a full join b using (k) order by k nulls last;
   -- before: 1, NULL          after: 1, 4
   
   -- the unqualified `k` is bound to a.k, not coalesce(a.k, b.k)
   select k as unqual_k, a.k as a_k, b.k as b_k
   from a full join b using (k) order by unqual_k nulls last;
   -- before: unqual_k row is NULL | NULL | 4     after: 4 | NULL | 4
   
   -- same defect for RIGHT JOIN USING and NATURAL RIGHT/FULL JOIN
   select k from a right join b using (k);                     -- before: NULL  
 after: 4
   select k from a natural full join b order by k nulls last;  -- before: 
1,NULL after: 1,4
   ```
   
   Downstream impact, also fixed:
   
   ```sql
   -- before: 0 rows (silently drops a real row); after: 1 row
   select * from a full join b using (k) where k = 4;
   
   -- before: right-only rows bucket under k=NULL; after: under their real key
   select k, count(*) from a full join b using (k) group by k;
   ```
   
   The standard-equivalent explicit form was already correct, confirming the 
divergence:
   
   ```sql
   select coalesce(a.k, b.k) as k, a.x, b.y
   from a full join b on a.k = b.k order by k nulls last;
   -- 1, 10, NULL
   -- 4, NULL, 400   <-- k = 4 here (correct)
   ```
   
   ## Solution
   
   DataFusion keeps both per-side keys (`left.k` and `right.k`) in the join 
output schema so they stay individually addressable, and `DFSchema` forbids an 
unqualified field coexisting with a same-named qualified one. So rather than 
materializing a merged key column, the merged key is exposed by resolving an 
*unqualified* reference to the coalesced value, leaving the join schema 
untouched.
   
   When planning a `RIGHT` / `FULL` `USING` / `NATURAL` join, `plan_using_join` 
registers each merged key in `PlannerContext` with its replacement expression: 
`COALESCE(left.k, right.k)` for `FULL`, or `right.k` for `RIGHT` (the right 
side is always present). An unqualified reference to the key is then rewritten 
to that replacement in `sql_identifier_to_expr` (covering `SELECT` / `WHERE` / 
`GROUP BY` / `HAVING` / `ORDER BY` at one chokepoint), guarded so it only fires 
where both per-side keys are in scope. `SELECT *` substitutes the coalesced 
value for the kept key column, and the key name is preserved as the output 
column name. The registry is scoped per `SELECT` so merged keys do not leak to 
sibling queries (e.g. the other arm of a `UNION`).
   
   `INNER` / `LEFT` joins are deliberately left unchanged: their left key is 
never NULL padded, so resolution to the left column is already correct and the 
common path keeps its existing plan.
   
   The result now matches PostgreSQL and the explicit `SELECT coalesce(a.k, 
b.k) ... ON a.k = b.k` form for `RIGHT`, `FULL`, `NATURAL RIGHT`, and `NATURAL 
FULL` joins:
   
   ```
   > explain format indent select k from a full join b using (k);
   
   Projection: coalesce(a.k, b.k) AS k
     Full Join: Using a.k = b.k
       TableScan: a projection=[k]
       TableScan: b projection=[k]
   ```
   
   There was previously **no sqllogictest coverage** for `RIGHT` / `FULL` 
`USING` / `NATURAL`, which is why this went unnoticed. This PR adds that 
coverage in `joins.slt` (data results, downstream `WHERE` / `GROUP BY`, `SELECT 
*`, and the logical/physical plans, including the unchanged `INNER` plan) plus 
planner unit tests in `sql_integration.rs`. The whole `sqllogictest` suite and 
all `datafusion-sql` tests pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to