Re: [QUESTION] Handle record partition change

Sivabalan Wed, 18 Dec 2019 10:32:40 -0800

Raymond,
     The patch <https://github.com/apache/incubator-hudi/pull/1091> which I
have put up works differently. If initial record is in Partition1, and
updates are sent to Partition2, we silently update the record in
Partition1. Guess you are asking for opposite, i.e. insert in Partition2
and delete record in Partition1. I am not sure about the usability of this
in general. Let's ask our experts in our group.


@vinoth, balaji and others:
Do we support both functionality or just one. If we plan to support both,
then it might incur api changes. or we could tackle with a config as well.

Here is the use-case.
- Insert record1 to partition1 with global bloom.
- Update record1 with partition set to partition2(different partition
compared to where the record is present as of now).

Option1:
Update record1 to Partition1 and do nothing in Partition2.
   - Since with global bloom, the primary key is just the record key and
hence partition is ignored.

Option2:
Insert a new record, record1 to Partition2. and Delete record1 from
Partition1.

I have already put up a patch for Option1. but looks like Raymond is
looking for Option2.





On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu <[email protected]>
wrote:

> Hi Sivabalan,
>
> Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to be
> looked up in different partitions. This is indeed helpful in the situation
> where the same record key gets updated on its partition path.
>
> Now I'm thinking when we "tagLocationBacktoRecords
> <
> https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112
> >",
> we could potentially create a delete operation for the record in the old
> partition while keeping the incoming insert operation for it in the new
> partition. This is crucial for avoiding duplicate records (with the same
> record keys) in the Hudi dataset. Is this some functionality already
> implemented? I might have missed some part of the logic from the codebase.
> Please kindly point out if I got any misunderstanding.
>
> Thank you.
>
> Best,
> Raymond
>
> On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> wrote:
>
> > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I
> know
> > which one are you talking about?
> >
> >
> > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <[email protected]>
> > wrote:
> >
> > > Hi Hudi devs,
> > >
> > > Upon upsert operations, does Hudi detect record's partition path
> change?
> > As
> > > for the same record, the partition path field may get updated while the
> > > record key (the primary id) stays the same, then the insert would
> result
> > in
> > > duplicate record (based on record key) in the dataset. Is there any
> > > relevant logic of this kind of detection and/or clean-up in the
> codebase?
> > >
> > > Best,
> > > Raymond
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


-- 
Regards,
-Sivabalan

Re: [QUESTION] Handle record partition change

Reply via email to