johanl-db opened a new pull request, #38400:
URL: https://github.com/apache/spark/pull/38400

   ### What changes were proposed in this pull request?
   This change adds a third type of WHEN clause to the MERGE INTO command that 
allows updating or deleting rows from the target table that have no match in 
the source table based on the merge condition.
   
   The following example updates all rows from the target table that have a 
match in the source table using the source value. For target rows that have no 
match in the source table, the 'state' column of rows that were created before 
'2022-10-26' is set to 'active', while rows created before that date are 
deleted from the target table.
   ```
   MERGE INTO target
   USING source
   ON target.key = source.key
   WHEN MATCHED THEN UPDATE SET *
   WHEN NOT MATCHED BY SOURCE AND target.create_at > '2022-10-26' THEN UPDATE 
SET target.status = 'active'
   WHEN NOT MATCHED BY SOURCE THEN DELETE 
   ```
   
   In addition, the existing WHEN NOT MATCHED clause can now also include an 
optional BY TARGET qualifier that has no effect on its semantic other than 
allowing a more consistent use of the clauses together:
   ```
   WHEN NOT MATCHED BY TARGET THEN INSERT * 
   ```
   is equivalent to:
   ```
   WHEN NOT MATCHED THEN INSERT *
   ```
   The updated SQL syntax for the MERGE INTO command is described more 
precisely in the user-facing change section below.
   
   The changes proposed in this pull request are two-fold:
   1. Update SQL parsing to handle the new clause introduced. This results in a 
new field `notMatchedBySourceActions` being populated in the logical plan node 
of the MERGE INTO command `MergeIntoTable`.
   2. Handle the newly added merge clause during analysis. In particular, 
resolve all attribute references used in WHEN NOT MATCHED BY SOURCE conditions 
and actions. The attributes used in a NOT MATCHED BY SOURCE clause may only 
refer to the target table.
   
   ### Why are the changes needed?
   The new clause opens up uses cases leveraging the merge command to sync a 
target from a source table by conditionally deleting or updating records that 
are not present in the source. As an example, the following command 
incrementally syncs the target table from the source table for the past 5 days:
   ```
   MERGE INTO target
   USING (SELECT `columns`  FROM source  WHERE created_at >= (current_date - 
INTERVAL ‘5’ DAY)  AS tmp_name
   ON FALSE
   WHEN NOT MATCHED BY SOURCE AND (current_date - INTERVAL ‘5’ DAY) THEN DELETE
   WHEN NOT MATCHED BY TARGET THEN INSERT `columns`
   ```
   After running this command, all rows older than 5 days in the target table 
are left unmodified while rows newer than 5 days that are either not already in 
the target table or not in the source table anymore are inserted and deleted, 
respectively.
   
   ### Does this PR introduce _any_ user-facing change?
   Two user-facing changes are introduced in the MERGE INTO syntax:
   - WHEN NOT MATCHED BY SOURCE clause.
   - Optional BY TARGET qualifier for WHEN NOT MATCHED clauses.
   
   The updated Spark SQL syntax is:
   ```
   MERGE INTO target_table_name [target_alias]
      USING source_table_reference [source_alias]
      ON merge_condition
      { WHEN MATCHED [ AND matched_condition ] THEN matched_action |
        WHEN NOT MATCHED [BY TARGET] [ AND not_matched_condition ] THEN 
not_matched_action
        WHEN NOT MATCHED BY SOURCE [ AND not_matched_by_source_condition ] THEN 
not_matched_by_source_action } [...]
   
   matched_action
    { DELETE |
      UPDATE SET * |
      UPDATE SET { column = [ expr | DEFAULT ] } [, ...] }
   
   not_matched_action
    { INSERT * |
      INSERT (column1 [, ...] ) VALUES ( expr | DEFAULT ] [, ...] )
   
   not_matched_by_source_action
    { DELETE |
      UPDATE SET { column = [ expr | DEFAULT ] } [, ...] }
   ```
   This syntax replicates the semantics used by other vendors, see:
   - (Microsoft T-SQL 
Merge)[https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql]
   - (Google BigQuery 
Merge)[https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement]
   
   
   ### How was this patch tested?
   Tests are extended or added to cover the following aspect of the change:
   - Parsing (in DDLParserSuite.scala):
     - Existing tests are extended to cover parsing of WHEN NOT MATCHED BY 
SOURCE clauses in a range of cases. This covers parsing the clause with 
optional conditions and a variety of UPDATE and DELETE actions.
     - New tests are added to cover NOT MATCHED BY TARGET and invalid action 
UPDATE SET * for WHEN NOT MATCHED BY SOURCE.
   - Analysis (in PlanResolutionSuite.scala):
     - Existing tests are extended to also cover attribute reference resolution 
for WHEN NOT MATCHED BY SOURCE conditions and actions together with other type 
of clauses.
     - New tests are added to cover reference resolution specific to WHEN NOT 
MATCHED BY SOURCE clauses:
       - Unqualified reference to a column present both in the target and 
source table is not ambiguous in WHEN NOT MATCHED BY SOURCE conditions or 
actions since it can only refer to the target table.
       - Reference to columns in the source table are invalid in WHEN NOT 
MATCHED BY SOURCE.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to