Interesting.. you captured the pitfalls I was alluding to nicely. IIUC you are doing multiple incremental pull vs table joins to reconcile. It should work.
On Tue, May 7, 2019 at 12:06 AM Jaimin Shah <[email protected]> wrote: > Hi > > Thanks for the quick response. > As we discussed we will pull changes incrementally and join with MOR read > optimized view. For example order will be pulled incrementally and will be > joined with read optimized view of seller and customer. Incrementally pull > seller and join with order and customer apply same process for customer > also. > > Regarding "safely" aligning windows I don't think we should bother with > this as data will be corrected while processing subsequent batch. For > example data inserted in seller is not reflected and order comes first for > the seller data will be missed in the first batch but in the next batch > insert in seller will be processed which will be joined with customer and > orders so it will be handled. We are fine with eventual consistency of the > data. Please correct me if I am missing some points. > > On Mon, 6 May 2019 at 23:55, Vinoth Chandar <[email protected]> wrote: > > > Reposting a discussion on slack as FYI. > > > > "Jaimin [3:10 AM] > > Hi > > We have a use-case where we have set of MOR base tables and flattened > > entities based on them. For example we have order,customer,seller table > and > > flattened entity based on joining these 3 tables. > > To create a flattened I think we need to fetch changes from each of these > > tables incrementally (incremental pull) and do join with rest of the > > complete tables. So there will n number of joins ( equal to number of > > tables involved in flattened entity). Is there any other efficient way to > > do this? Also for the join will we need our own spark job or hudi > provides > > these capabilities also? > > Also our data can have deletes also I am using empty payload > implementation > > to delete data. I tried out this we sample data deleting data from base > > table compacting and then using incremental pull to fetch changes but I > > didn't see deletes as part of incremental pull. Am I missing something? > > Thanks " > > > > and my response > > > > " > > you can pull 3 tables and join them in a custom Spark job, that should be > > fine. (yes you need your own spark job.. DeltaStreamer tool supports > > transforms.. but limits itself to 1 table pulled incrementally).. What > > Nishith is alluding to is to be able to "safely" aligning windows between > > the 3 tables, which needs more business context as to determine.. For > e.g, > > if you are joining the 3 tables based on order_id, then you need to be > sure > > that the order shows up on customer/seller/order tables in the same time > > range you are pulling for.. > > > > @Jaimin This is such an interesting topic.. I will start a thread on the > > mailing list. Please join and we can continue there, so others can also > > jump in.. https://hudi.apache.org/community.html > > " > > >
