JingsongLi commented on PR #8028:
URL: https://github.com/apache/paimon/pull/8028#issuecomment-4592376016
I took another pass over the latest commit and I still think this should
stay unmerged for now because of two semantic issues in the new Ray merge path:
1. Insert-only merge does not carry the base snapshot into conflict
detection. The update path passes `base_snapshot_id` into
`distributed_update_apply`, and the update messages get `check_from_snapshot`
through `TableUpdateByRowId`. The insert path, however, collects messages from
`distributed_write_collect_msgs` without setting `check_from_snapshot`. Since
`FileStoreCommit` only enables row-id conflict detection when at least one
commit message has `check_from_snapshot != -1`, a `WHEN NOT MATCHED THEN INSERT
*` merge can anti-join against snapshot N, then commit after a concurrent
writer has inserted a matching key, and still append the stale unmatched source
row. Spark's data-evolution merge calls `rowIdCheckConflict(plan.snapshotId())`
for the whole commit; the Ray path should make insert-only and combined commits
participate in the same base-snapshot conflict check.
2. `UPDATE SET *` includes partition columns, but partial row-id writes are
emitted back to the original partition. `_prepare` builds
`settable_field_names` from all non-blob target fields,
`_normalize_set_spec("*")` maps every one of them to `s.<col>`, and then
`TableUpdateByRowId._write_group` writes the partial file using the original
split partition. If a matched source row changes a partition key, this produces
a partial column file under the old partition carrying the new partition value,
rather than moving the row or rejecting the change. That can make partition
pruning / row contents inconsistent. The new tests only cover unpartitioned
tables, so this case is not guarded.
I would fix these before exposing `merge_into` as a public Ray API: set the
base snapshot/conflict check for insert messages too, and either reject
partition-key changes in matched updates or implement a real cross-partition
rewrite, with a partitioned data-evolution merge test covering it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]