raghav-reglobe commented on PR #22988:
URL: https://github.com/apache/datafusion/pull/22988#issuecomment-4878506556
Took this branch for a spin against a provider-side consumer (we're building
an Iceberg SCD2 upsert on this hook), and I think there's one blocking issue
the current tests don't reach: **any target-column reference in the `ON` or
`WHEN` expressions fails analysis**, so a realistic MERGE can't execute through
`SessionContext::sql()`.
Minimal repro (custom provider registered as `t`, MemTable as `batch`):
```sql
MERGE INTO t USING batch s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET val = s.val
```
```
Error: Context("type_coercion", SchemaError(FieldNotFound { field: Column {
relation: Some(Bare { table: "t" }), name: "id" }, valid_fields: [s.id, s.val,
…] }))
```
What's happening: the `Dml` node's only input is the planned USING source,
but `MergeIntoOp`'s expressions reference the combined target×source namespace.
Since this PR wires those expressions into the generic tree-node traversal
(`tree_node.rs`), the analyzer's `TypeCoercion` pass rewrites them against the
node's input schema — which is source-only — and errors on the first target
reference. As a probe, `ON s.id = s.id` (source-only) sails through analysis
and reaches `merge_into()` fine, which pins the root cause. Everything else
held up nicely in testing, for what it's worth — a window-function subquery as
the USING source (`ROW_NUMBER`/`LEAD`), multi-condition `ON`, clause
predicates, and the INSERT validation all plan exactly as expected.
Two directions I can see:
1. **Minimal**: don't expose `MergeIntoOp` exprs through the generic
traversal (keep `exprs()`/`with_new_exprs` as inherent methods only). The
analyzer/optimizer then leave the node alone, and providers receive un-coerced
expressions — acceptable for providers that do their own resolution, and it
ships a working v1.
2. **Fuller**: record the target's resolution qualifier on `MergeIntoOp`
(the alias when present, else the table reference — today the alias is dropped
after planning), and teach `TypeCoercion` to resolve MergeInto exprs against
target-schema ⋈ input-schema. Coercion on these expressions genuinely matters
(e.g. an `Int32` target key against an `Int64` window output in `ON`), and the
recorded qualifier also gives providers an explicit way to tell target
references from source references instead of inferring by absence.
I'd lean toward 2, but 1 seems fine to unblock if you'd rather keep this PR
small. Happy to share the test module I used — a `CaptureMergeProvider` in the
style of `dml_planning.rs` with basic + SCD2-shaped (window-subquery USING,
triple-condition ON, clause predicate) cases — if it's useful as a starting
point for end-to-end coverage here. Also happy to put up a patch for either
direction if that helps.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]