Pengfei Xu created SPARK-56942:
----------------------------------
Summary: Support nested column references as DSv2 row IDs
Key: SPARK-56942
URL: https://issues.apache.org/jira/browse/SPARK-56942
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.2.0
Reporter: Pengfei Xu
Connectors that implement `SupportsDelta` declare row identifiers via
`rowId()`, which returns `NamedReference[]`. A `NamedReference` may be
multi-segment (e.g. `["data", "pk"]` or `["_metadata", "row_index"]`), so the
API contract permits nested row IDs.
During analysis, however, Spark calls
`V2ExpressionUtils.resolveRefs[AttributeReference](operation.rowId, relation)`
from both `RewriteRowLevelCommand.resolveRowIdAttrs` and
`WriteDelta.rowIdAttrsResolved`. For a multi-segment reference, the resolver
returns `Alias(GetStructField(...))` and the `asInstanceOf[AttributeReference]`
cast throws `ClassCastException` before any plan executes. DELETE / UPDATE /
MERGE against such a connector fails outright.
Widen the resolver call to `resolveRefs[NamedExpression]` and flatten back via
`.toAttribute`. Both flat and nested row-id columns then work; flat-column
behavior is unchanged.
This unblocks DSv2 connectors that identify rows by file-source metadata such
as `(_metadata.file_path, _metadata.row_index)` -- the natural identity for
position-delete / deletion-vector writes. Iceberg's DSv1
`SparkPositionDeltaOperation` uses an analogous `[_file, _pos]` pattern; this
lets DSv2 connectors follow suit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]