[PR] [spark] Merge into supports _ROW_ID shortcut [paimon]

via GitHub Thu, 04 Dec 2025 00:44:40 -0800


wayneli-vt opened a new pull request, #6745:
URL: https://github.com/apache/paimon/pull/6745


   <!-- Please specify the module before the PR name: [core] ... or [flink] ... 
-->
   
   ### Purpose
   <!-- What is the purpose of the change -->
   This PR enhances the `MERGE INTO` command by adding a specialized execution 
path for `_ROW_ID`-based joins.
   
   Currently, when performing a `MERGE INTO` operation, the process to find the 
relevant `DataSplit`s for modification involves a full join between the target 
and source tables. This PR introduces a shortcut optimization when the merge 
condition is a simple equality on the target's `_ROW_ID` (e.g., `ON 
target._ROW_ID = source.col`). 
   
   If so, it directly scans the source table's `col` to identify the relevant 
`_ROW_ID`s to determine the affected splits. This avoids the need for a full 
join. For all other merge conditions, the existing join-based strategy is used, 
preserving the original behavior.
   
   
   ### Tests
   
   <!-- List UT and IT cases to verify this change -->
   A new test case has been added to `RowTrackingTestBase` to specifically 
verify this pr:
   
   * `org.apache.paimon.spark.sql.RowTrackingTestBase#Data Evolution: merge 
into table with data-evolution with _ROW_ID shortcut`
   
   ### API and Format
   
   <!-- Does this change affect API or storage format -->
   No.
   
   ### Documentation
   
   <!-- Does this change introduce a new feature -->
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [spark] Merge into supports _ROW_ID shortcut [paimon]

Reply via email to