Reposting a discussion on slack as FYI.

"Jaimin [3:10 AM]
Hi
We have a use-case where we have set of MOR base tables and flattened
entities based on them. For example we have order,customer,seller table and
flattened entity  based on joining these 3 tables.
To create a flattened I think we need to fetch changes from each of these
tables incrementally (incremental pull) and do join with rest of the
complete tables. So there will n number of joins ( equal to number of
tables involved in flattened entity). Is there any other efficient way to
do this? Also for the join will we need our own spark job or hudi provides
these capabilities also?
Also our data can have deletes also I am using empty payload implementation
to delete data. I tried out this we sample data deleting data from base
table compacting and then using incremental pull to fetch changes but I
didn't see deletes as part of incremental pull. Am I missing something?
Thanks "

and my response

"
you can pull 3 tables and join them in a custom Spark job, that should be
fine. (yes you need your own spark job.. DeltaStreamer tool supports
transforms.. but limits itself to 1 table pulled incrementally)..  What
Nishith is alluding to is to be able to "safely" aligning windows between
the 3 tables, which needs more business context as to determine.. For e.g,
if you are joining the 3 tables based on order_id, then you need to be sure
that the order shows up on customer/seller/order tables in the same time
range you are pulling for..

@Jaimin This is such an interesting topic.. I will start a thread on the
mailing list. Please join and we can continue there, so others can also
jump in.. https://hudi.apache.org/community.html
"

Reply via email to