kosiew opened a new pull request, #20745:
URL: https://github.com/apache/datafusion/pull/20745
## Which issue does this PR close?
* Closes #19950.
## Rationale for this change
`UPDATE ... FROM` was planned incorrectly and effectively unusable in
DataFusion. The SQL layer rejected the syntax outright, and the underlying
planning/evaluation path also stripped qualifiers from assignment expressions.
That meant expressions such as `t2.b` could be rebound as target-table columns,
so joined values were not applied correctly.
This change enables the supported single-source `UPDATE ... FROM` flow and
fixes the core binding issue by preserving source-qualified expressions for
multi-table updates. It also gives table providers a dedicated execution path
for updates that depend on joined input rows, instead of forcing all updates
through the single-table assignment API.
## What changes are included in this PR?
This PR adds end-to-end support for single-source `UPDATE ... FROM` and
wires it through planning, provider APIs, MemTable execution, and tests.
At a high level, the changes include:
* removing the SQL-layer `not_impl` guard that previously rejected `UPDATE
... FROM`;
* extending `TableProvider` with a new `update_from(...)` hook for
multi-table updates driven by a physical input plan;
* updating the physical planner to distinguish between:
* single-table `UPDATE`, which still uses extracted assignment
expressions; and
* `UPDATE ... FROM`, which now passes an optimized physical input plan
plus target-only filters to the provider;
* preserving qualified source references in assignment extraction for
multi-table updates, while keeping the existing qualifier-stripping behavior
for single-table updates;
* improving identity-assignment detection so aliased target references are
treated correctly;
* adding helper logic to detect joins and collect target-table aliases
during planning;
* implementing `MemTable::update_from(...)`, including:
* collecting replacement rows from the physical input,
* validating schema equivalence,
* counting matched target rows,
* rejecting plans where replacement row counts do not match the number of
target rows to update,
* merging replacement values back into target batches using the update
mask;
* clearing MemTable sort-order metadata after mutation, consistent with
update behavior;
* updating custom provider DML tests to exercise the new provider path and
verify that only target-table predicates are forwarded as provider filters;
* adding planner/unit tests for alias handling and assignment extraction; and
* adding sqllogictest coverage for explain plans, alias variants, successful
execution, and mismatch/error behavior.
This PR still keeps the existing limitation that `UPDATE ... FROM` supports
only a single source table. Queries with multiple tables in the `FROM` clause
continue to return a `not implemented` error.
## Are these changes tested?
Yes.
The patch adds and updates tests across several layers:
* physical planner unit tests for assignment extraction in both single-table
and `UPDATE ... FROM` cases;
* custom source DML planning tests to verify provider behavior, alias
handling, and target-filter forwarding;
* sqllogictests covering:
* logical and physical plans for `UPDATE ... FROM`,
* successful updates against actual data,
* target/source alias permutations, and
* row-count mismatch error handling for invalid joined replacement results.
These tests cover both the original reported bug and the new execution path
introduced for table providers and MemTable.
## Are there any user-facing changes?
Yes.
DataFusion now supports single-source `UPDATE ... FROM` statements,
including target/source aliases and source-qualified assignment expressions
such as:
```sql
UPDATE t1 AS dst
SET b = src.b, d = src.d
FROM t2 AS src
WHERE dst.a = src.a;
```
Previously, this syntax was rejected or failed to apply joined source values
correctly. After this change, supported `UPDATE ... FROM` statements plan and
execute correctly for MemTable and for providers that implement
`update_from(...)`.
There is also a small public API change for table providers: `TableProvider`
now includes a new async `update_from(...)` method for multi-table update
execution. Providers that do not implement it will continue to return a `not
implemented` error for this operation.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]