Pengfei Xu created SPARK-56942:
----------------------------------

             Summary: Support nested column references as DSv2 row IDs
                 Key: SPARK-56942
                 URL: https://issues.apache.org/jira/browse/SPARK-56942
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Pengfei Xu


Connectors that implement `SupportsDelta` declare row identifiers via 
`rowId()`, which returns `NamedReference[]`. A `NamedReference` may be 
multi-segment (e.g. `["data", "pk"]` or `["_metadata", "row_index"]`), so the 
API contract permits nested row IDs.

During analysis, however, Spark calls 
`V2ExpressionUtils.resolveRefs[AttributeReference](operation.rowId, relation)` 
from both `RewriteRowLevelCommand.resolveRowIdAttrs` and 
`WriteDelta.rowIdAttrsResolved`. For a multi-segment reference, the resolver 
returns `Alias(GetStructField(...))` and the `asInstanceOf[AttributeReference]` 
cast throws `ClassCastException` before any plan executes. DELETE / UPDATE / 
MERGE against such a connector fails outright.

Widen the resolver call to `resolveRefs[NamedExpression]` and flatten back via 
`.toAttribute`. Both flat and nested row-id columns then work; flat-column 
behavior is unchanged.

This unblocks DSv2 connectors that identify rows by file-source metadata such 
as `(_metadata.file_path, _metadata.row_index)` -- the natural identity for 
position-delete / deletion-vector writes. Iceberg's DSv1 
`SparkPositionDeltaOperation` uses an analogous `[_file, _pos]` pattern; this 
lets DSv2 connectors follow suit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to