hudi-bot opened a new issue, #15769:
URL: https://github.com/apache/hudi/issues/15769

   Sometimes users have requirements to support SCD2 with hudi. Let's see if we 
need a custom payload or if we can write a transformer in deltastreamer to 
achieve this. 
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5730
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   08/Feb/23 08:36;shivnarayan;For deltastreamer transformer, here is how one 
can achieve this. 
   
   Pseudo code for the transformer: 
   
   // Input batch could have mix of inserts and updates. We need to split this 
into 3 categories, namely, insert only records, records which are getting 
updated and will be active, records that are overridden will belong to  
inactive batch. 
   
   1. Read all records from hudi
   
   2. Find inserts by doing left-anti join w/ target hudi table. (input df). 
lets call this insertDf
   
   3. Join (inner join) inputDf w/ target Df to find the set of records that 
are being updated. Split this into two dataframes. a: to be updated df (end 
time is inifinite/max) and activeIndex = 1. b: to be overwritten df (endtime = 
new start time -1  and active index = 0). 
   
   4. Once we have all 3 dfs, we need to union them and then return from the 
transformer. 
   
   Notes:
    * Record key should be complex and should have "active_index" as part of 
the record key fields. If not, older version of records could be deleted. 
   
    
   
   I have a runbook which achieves SCD2. 
[https://gist.github.com/nsivabalan/e0153f07f972d02e111761243190e7d1] 
   
   this leverages our quick start guide as an example. Feel free to fix the 
schema, partitioning columns, etc as you see fit. 
   
    ;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to