[
https://issues.apache.org/jira/browse/HUDI-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685757#comment-17685757
]
sivabalan narayanan commented on HUDI-5730:
-------------------------------------------
For deltastreamer transformer, here is how one can achieve this.
Pseudo code for the transformer:
// Input batch could have mix of inserts and updates. We need to split this
into 3 categories, namely, insert only records, records which are getting
updated and will be active, records that are overridden will belong to
inactive batch.
1. Read all records from hudi
2. Find inserts by doing left-anti join w/ target hudi table. (input df). lets
call this insertDf
3. Join (inner join) inputDf w/ target Df to find the set of records that are
being updated. Split this into two dataframes. a: to be updated df (end time is
inifinite/max) and activeIndex = 1. b: to be overwritten df (endtime = new
start time -1 and active index = 0).
4. Once we have all 3 dfs, we need to union them.
Notes:
* Record key should be complex and should have "active_index" as part of the
record key fields
I have a runbook which achieves SCD2.
[https://gist.github.com/nsivabalan/e0153f07f972d02e111761243190e7d1]
this leverages our quick start guide as an example. Feel free to fix the
schema, partitioning columns, etc as you see fit.
> Support for SCD2 with hudi
> --------------------------
>
> Key: HUDI-5730
> URL: https://issues.apache.org/jira/browse/HUDI-5730
> Project: Apache Hudi
> Issue Type: Improvement
> Components: writer-core
> Reporter: sivabalan narayanan
> Priority: Major
>
> Sometimes users have requirements to support SCD2 with hudi. Let's see if we
> need a custom payload or if we can write a transformer in deltastreamer to
> achieve this.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)