Interesting.. you captured the pitfalls I was alluding to nicely.
IIUC you are doing multiple incremental pull vs table joins to reconcile.
It should work.





On Tue, May 7, 2019 at 12:06 AM Jaimin Shah <[email protected]>
wrote:

> Hi
>
> Thanks for the quick response.
> As we discussed we will pull changes incrementally and join with MOR read
> optimized view. For example order will be pulled incrementally and will be
> joined with read optimized view of  seller and customer. Incrementally pull
> seller and join with order and customer apply same process for customer
> also.
>
> Regarding "safely" aligning windows I don't think we should bother with
> this as data will be corrected while processing subsequent batch. For
> example data inserted in seller is not reflected and order comes first for
> the seller data will be missed in the first batch but in the next batch
> insert in seller will be processed which will be joined with customer and
> orders so it will be handled. We are fine with eventual consistency of the
> data. Please correct me if I am missing some points.
>
> On Mon, 6 May 2019 at 23:55, Vinoth Chandar <[email protected]> wrote:
>
> > Reposting a discussion on slack as FYI.
> >
> > "Jaimin [3:10 AM]
> > Hi
> > We have a use-case where we have set of MOR base tables and flattened
> > entities based on them. For example we have order,customer,seller table
> and
> > flattened entity  based on joining these 3 tables.
> > To create a flattened I think we need to fetch changes from each of these
> > tables incrementally (incremental pull) and do join with rest of the
> > complete tables. So there will n number of joins ( equal to number of
> > tables involved in flattened entity). Is there any other efficient way to
> > do this? Also for the join will we need our own spark job or hudi
> provides
> > these capabilities also?
> > Also our data can have deletes also I am using empty payload
> implementation
> > to delete data. I tried out this we sample data deleting data from base
> > table compacting and then using incremental pull to fetch changes but I
> > didn't see deletes as part of incremental pull. Am I missing something?
> > Thanks "
> >
> > and my response
> >
> > "
> > you can pull 3 tables and join them in a custom Spark job, that should be
> > fine. (yes you need your own spark job.. DeltaStreamer tool supports
> > transforms.. but limits itself to 1 table pulled incrementally)..  What
> > Nishith is alluding to is to be able to "safely" aligning windows between
> > the 3 tables, which needs more business context as to determine.. For
> e.g,
> > if you are joining the 3 tables based on order_id, then you need to be
> sure
> > that the order shows up on customer/seller/order tables in the same time
> > range you are pulling for..
> >
> > @Jaimin This is such an interesting topic.. I will start a thread on the
> > mailing list. Please join and we can continue there, so others can also
> > jump in.. https://hudi.apache.org/community.html
> > "
> >
>

Reply via email to