[
https://issues.apache.org/jira/browse/HIVE-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Carter Shanklin updated HIVE-16258:
-----------------------------------
Summary: Suggestion: simplify type 2 SCDs with this non-standard extension
to MERGE (was: Suggesting a non-standard extension to MERGE)
> Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE
> --------------------------------------------------------------------------
>
> Key: HIVE-16258
> URL: https://issues.apache.org/jira/browse/HIVE-16258
> Project: Hive
> Issue Type: Improvement
> Components: Transactions
> Affects Versions: 2.2.0
> Reporter: Carter Shanklin
>
> Some common data maintenance strategies, especially the Type 2 SCD update,
> would become substantially easier with a small extension to the SQL standard
> for MERGE, specifically the ability to say "when matched then insert". Per
> the standard, matched records can only be updated or deleted.
> In the Type 2 SCD, when a new record comes in you update the old version of
> the record and insert the new version of the same record. If this extension
> were supported, sample Type 2 SCD code would look as follows:
> {code}
> merge into customer
> using new_customer_stage stage
> on stage.source_pk = customer.source_pk
> when not matched then insert values /* Insert a net new record */
> (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state,
> true, null)
> when matched then update set /* Update an old record to mark it as
> out-of-date */
> is_current = false, end_date = current_date()
> when matched then insert values /* Insert a new current record */
> (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state,
> true, null);
> {code}
> Without this support, the user needs to devise some sort of workaround. A
> common approach is to first left join the staging table against the table to
> be updated, then to join these results to a helper table that will spit out
> two records for each match and one record for each miss. One of the matching
> records needs to have a join key that can never occur in the source data so
> this requires precise knowledge of the source dataset.
> An example of this:
> {code}
> merge into customer
> using (
> select
> *,
> coalesce(invalid_key, source_pk) as join_key
> from (
> select
> stage.source_pk, stage.name, stage.state,
> case when customer.source_pk is null then 1
> when stage.name <> customer.name or stage.state <> customer.state then 2
> else 0 end as scd_row_type
> from
> new_customer_stage stage
> left join
> customer
> on (stage.source_pk = customer.source_pk and customer.is_current = true)
> ) updates
> join scd_types on scd_types.type = scd_row_type
> ) sub
> on sub.join_key = customer.source_pk
> when matched then update set
> is_current = false,
> end_date = current_date()
> when not matched then insert values
> (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true,
> null);
> select * from customer order by source_pk;
> {code}
> This code is very complicated and will fail if the "invalid" key ever shows
> up in the source dataset. This simple extension provides a lot of value and
> likely very little maintenance overhead.
> /cc [~ekoifman]
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)