[ 
https://issues.apache.org/jira/browse/HUDI-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685757#comment-17685757
 ] 

sivabalan narayanan commented on HUDI-5730:
-------------------------------------------

For deltastreamer transformer, here is how one can achieve this. 

Pseudo code for the transformer: 

// Input batch could have mix of inserts and updates. We need to split this 
into 3 categories, namely, insert only records, records which are getting 
updated and will be active, records that are overridden will belong to  
inactive batch. 

1. Read all records from hudi

2. Find inserts by doing left-anti join w/ target hudi table. (input df). lets 
call this insertDf

3. Join (inner join) inputDf w/ target Df to find the set of records that are 
being updated. Split this into two dataframes. a: to be updated df (end time is 
inifinite/max) and activeIndex = 1. b: to be overwritten df (endtime = new 
start time -1  and active index = 0). 

4. Once we have all 3 dfs, we need to union them. 

 

Notes:
 * Record key should be complex and should have "active_index" as part of the 
record key fields

 

I have a runbook which achieves SCD2. 
[https://gist.github.com/nsivabalan/e0153f07f972d02e111761243190e7d1] 

this leverages our quick start guide as an example. Feel free to fix the 
schema, partitioning columns, etc as you see fit. 

 

> Support for SCD2 with hudi
> --------------------------
>
>                 Key: HUDI-5730
>                 URL: https://issues.apache.org/jira/browse/HUDI-5730
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: writer-core
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> Sometimes users have requirements to support SCD2 with hudi. Let's see if we 
> need a custom payload or if we can write a transformer in deltastreamer to 
> achieve this. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to