Carter Shanklin created HIVE-16258:
--------------------------------------

             Summary: Suggesting a non-standard extension to MERGE
                 Key: HIVE-16258
                 URL: https://issues.apache.org/jira/browse/HIVE-16258
             Project: Hive
          Issue Type: Improvement
            Reporter: Carter Shanklin


Some common data maintenance strategies, especially the Type 2 SCD update, 
would become substantially easier with a small extension to the SQL standard 
for MERGE, specifically the ability to say "when matched then insert". Per the 
standard, matched records can only be updated or deleted.

In the Type 2 SCD, when a new record comes in you update the old version of the 
record and insert the new version of the same record. If this extension were 
supported, sample Type 2 SCD code would look as follows:

{code}
merge into customer
using new_customer_stage stage
on stage.source_pk = customer.source_pk
when not matched then insert values    /* Insert a net new record */
  (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, 
true, null)
when matched then update set           /* Update an old record to mark it as 
out-of-date */
  is_current = false, end_date = current_date()
when matched then insert values        /* Insert a new current record */
  (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, 
true, null);
{code}

Without this support, the user needs to devise some sort of workaround. A 
common approach is to first left join the staging table against the table to be 
updated, then to join these results to a helper table that will spit out two 
records for each match and one record for each miss. One of the matching 
records needs to have a join key that can never occur in the source data so 
this requires precise knowledge of the source dataset.

An example of this:
{code}
merge into customer
using (
  select
    *,
    coalesce(invalid_key, source_pk) as join_key
  from (
    select
      stage.source_pk, stage.name, stage.state,
      case when customer.source_pk is null then 1
      when stage.name <> customer.name or stage.state <> customer.state then 2
      else 0 end as scd_row_type
    from
      new_customer_stage stage
    left join
      customer
    on (stage.source_pk = customer.source_pk and customer.is_current = true)
  ) updates
  join scd_types on scd_types.type = scd_row_type
) sub
on sub.join_key = customer.source_pk
when matched then update set
  is_current = false,
  end_date = current_date()
when not matched then insert values
  (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, 
null);
select * from customer order by source_pk;
{code}

This code is very complicated and will fail if the "invalid" key ever shows up 
in the source dataset. This simple extension provides a lot of value and likely 
very little maintenance overhead.

/cc [~ekoifman]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to