Thanks Sivabalan. Exactly, that's what I meant. I can think of a usecase for option 2: a Hudi dataset manages people info and partitioned by birthday. In most cases, where people info are updated, birthdays are not to be changed (that's why we choose it as partition field). But in some edge cases where birthday info are input wrongly and we want to manually fix it or allow user to updated it occasionally. In this case, option 2 would be helpful in keeping records in the expected partition, so that a query like "show me people who were born after 2000" would work.
I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help achieve both options. On Wed, Dec 18, 2019 at 10:32 AM Sivabalan <[email protected]> wrote: > Raymond, > The patch <https://github.com/apache/incubator-hudi/pull/1091> which > I > have put up works differently. If initial record is in Partition1, and > updates are sent to Partition2, we silently update the record in > Partition1. Guess you are asking for opposite, i.e. insert in Partition2 > and delete record in Partition1. I am not sure about the usability of this > in general. Let's ask our experts in our group. > > @vinoth, balaji and others: > Do we support both functionality or just one. If we plan to support both, > then it might incur api changes. or we could tackle with a config as well. > > Here is the use-case. > - Insert record1 to partition1 with global bloom. > - Update record1 with partition set to partition2(different partition > compared to where the record is present as of now). > > Option1: > Update record1 to Partition1 and do nothing in Partition2. > - Since with global bloom, the primary key is just the record key and > hence partition is ignored. > > Option2: > Insert a new record, record1 to Partition2. and Delete record1 from > Partition1. > > I have already put up a patch for Option1. but looks like Raymond is > looking for Option2. > > > > > > On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu <[email protected]> > wrote: > > > Hi Sivabalan, > > > > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to > be > > looked up in different partitions. This is indeed helpful in the > situation > > where the same record key gets updated on its partition path. > > > > Now I'm thinking when we "tagLocationBacktoRecords > > < > > > https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112 > > >", > > we could potentially create a delete operation for the record in the old > > partition while keeping the incoming insert operation for it in the new > > partition. This is crucial for avoiding duplicate records (with the same > > record keys) in the Hudi dataset. Is this some functionality already > > implemented? I might have missed some part of the logic from the > codebase. > > Please kindly point out if I got any misunderstanding. > > > > Thank you. > > > > Best, > > Raymond > > > > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> wrote: > > > > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I > > know > > > which one are you talking about? > > > > > > > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <[email protected] > > > > > wrote: > > > > > > > Hi Hudi devs, > > > > > > > > Upon upsert operations, does Hudi detect record's partition path > > change? > > > As > > > > for the same record, the partition path field may get updated while > the > > > > record key (the primary id) stays the same, then the insert would > > result > > > in > > > > duplicate record (based on record key) in the dataset. Is there any > > > > relevant logic of this kind of detection and/or clean-up in the > > codebase? > > > > > > > > Best, > > > > Raymond > > > > > > > > > > > > > -- > > > Regards, > > > -Sivabalan > > > > > > > > -- > Regards, > -Sivabalan >
