hudi-bot opened a new issue, #15769: URL: https://github.com/apache/hudi/issues/15769
Sometimes users have requirements to support SCD2 with hudi. Let's see if we need a custom payload or if we can write a transformer in deltastreamer to achieve this. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-5730 - Type: Improvement --- ## Comments 08/Feb/23 08:36;shivnarayan;For deltastreamer transformer, here is how one can achieve this. Pseudo code for the transformer: // Input batch could have mix of inserts and updates. We need to split this into 3 categories, namely, insert only records, records which are getting updated and will be active, records that are overridden will belong to inactive batch. 1. Read all records from hudi 2. Find inserts by doing left-anti join w/ target hudi table. (input df). lets call this insertDf 3. Join (inner join) inputDf w/ target Df to find the set of records that are being updated. Split this into two dataframes. a: to be updated df (end time is inifinite/max) and activeIndex = 1. b: to be overwritten df (endtime = new start time -1 and active index = 0). 4. Once we have all 3 dfs, we need to union them and then return from the transformer. Notes: * Record key should be complex and should have "active_index" as part of the record key fields. If not, older version of records could be deleted. I have a runbook which achieves SCD2. [https://gist.github.com/nsivabalan/e0153f07f972d02e111761243190e7d1] this leverages our quick start guide as an example. Feel free to fix the schema, partitioning columns, etc as you see fit. ;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
