[
https://issues.apache.org/jira/browse/SPARK-56942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56942:
-----------------------------------
Labels: pull-request-available (was: )
> Support nested column references as DSv2 row IDs
> ------------------------------------------------
>
> Key: SPARK-56942
> URL: https://issues.apache.org/jira/browse/SPARK-56942
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Pengfei Xu
> Priority: Major
> Labels: pull-request-available
>
> Connectors that implement `SupportsDelta` declare row identifiers via
> `rowId()`, which returns `NamedReference[]`. A `NamedReference` may be
> multi-segment (e.g. `["data", "pk"]` or `["_metadata", "row_index"]`), so the
> API contract permits nested row IDs.
> During analysis, however, Spark calls
> `V2ExpressionUtils.resolveRefs[AttributeReference](operation.rowId,
> relation)` from both `RewriteRowLevelCommand.resolveRowIdAttrs` and
> `WriteDelta.rowIdAttrsResolved`. For a multi-segment reference, the resolver
> returns `Alias(GetStructField(...))` and the
> `asInstanceOf[AttributeReference]` cast throws `ClassCastException` before
> any plan executes. DELETE / UPDATE / MERGE against such a connector fails
> outright.
> Widen the resolver call to `resolveRefs[NamedExpression]` and flatten back
> via `.toAttribute`. Both flat and nested row-id columns then work;
> flat-column behavior is unchanged.
> This unblocks DSv2 connectors that identify rows by file-source metadata such
> as `(_metadata.file_path, _metadata.row_index)` -- the natural identity for
> position-delete / deletion-vector writes. Iceberg's DSv1
> `SparkPositionDeltaOperation` uses an analogous `[_file, _pos]` pattern; this
> lets DSv2 connectors follow suit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]