Re: [DISCUSS] New RFC to support Lock-free concurrency control on Merge-on-read tables

2022-03-24 Thread Jian Feng
sure, I'm working on it, will add you as a co-author when create a pr

On Fri, Mar 25, 2022 at 1:17 AM Vinoth Chandar  wrote:

> +1. Love to be a co-author on the RFC, if you are open to it.
>
> On Mon, Mar 21, 2022 at 12:31 PM 冯健  wrote:
>
> > Hi team,
> >
> > The situation is Optimistic concurrency control(OCC) has some limitation
> >
> >-
> >
> >When conflicts do occur, they may waste massive resources during every
> >attempt (lakehouse-concurrency-control-are-we-too-optimistic
> ><
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_blog_2021_12_16_lakehouse-2Dconcurrency-2Dcontrol-2Dare-2Dwe-2Dtoo-2Doptimistic=DwIFaQ=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw=bXAq09cDo2vOJ-2Uz9h3CslJmeCj9JMbo5X-gCHPF24=rz6Mo5568KcwmokXd967obpw0RNDcDJepfrUmf9KUxgfK14-uOfJSLb4l7xpCxqp=GFRt00qSBTRTWbGjUo-UBInLiU88zE_YbvHP0UO_geE=
> > >
> >).
> >-
> >
> >multiple writers may cause data duplicates when records with same new
> >record-key arrives.multi-writer-guarantees
> ><
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_concurrency-5Fcontrol-23multi-2Dwriter-2Dguarantees=DwIFaQ=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw=bXAq09cDo2vOJ-2Uz9h3CslJmeCj9JMbo5X-gCHPF24=rz6Mo5568KcwmokXd967obpw0RNDcDJepfrUmf9KUxgfK14-uOfJSLb4l7xpCxqp=H7a3yrvObNIz8WpuChSWN9X8fKpMslfTeiRJ29U3Tkg=
> >
> >
> > There is some background information, with OCC, we assume Multiple
> writers
> > won't write data to same FileID in most of time, if there is a FileId
> level
> > conflict, the commit will be rollbacked. and FileID level conflict can't
> > guarantee no duplicate if two records with same new record-key arrives in
> > multiple writers, since the mapping of key-bucket is not consistent with
> > bloom index.
> >
> > What I plan to do is support Lock-free concurrency control with a
> > non-duplicates guarantee in hudi(only for Merge-On-Read tables).
> >
> >-
> >
> >With canIndexLogfiles index , multiple writers ingesting data into
> >Merge-on-read tables can only append data to delta logs. This is a
> >lock-free process if we can make sure they don’t write data to the
> same
> > log
> >file (plan to create multiple marker files to achieve this). And with
> > log
> >merge API(preCombine logic in Payload class), data in log files can be
> > read
> >properly
> >-
> >
> >Since hudi already has an index type like Bucket index which can map
> >key-bucket in a consistent way.  Data duplicates can be eliminated
> >
> >
> > Thanks,
> > Jian Feng
> >
>


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure


Re: [DISCUSS] New RFC to support Lock-free concurrency control on Merge-on-read tables

2022-03-24 Thread Vinoth Chandar
+1. Love to be a co-author on the RFC, if you are open to it.

On Mon, Mar 21, 2022 at 12:31 PM 冯健  wrote:

> Hi team,
>
> The situation is Optimistic concurrency control(OCC) has some limitation
>
>-
>
>When conflicts do occur, they may waste massive resources during every
>attempt (lakehouse-concurrency-control-are-we-too-optimistic
><
> https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic
> >
>).
>-
>
>multiple writers may cause data duplicates when records with same new
>record-key arrives.multi-writer-guarantees
><
> https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees>
>
> There is some background information, with OCC, we assume Multiple writers
> won't write data to same FileID in most of time, if there is a FileId level
> conflict, the commit will be rollbacked. and FileID level conflict can't
> guarantee no duplicate if two records with same new record-key arrives in
> multiple writers, since the mapping of key-bucket is not consistent with
> bloom index.
>
> What I plan to do is support Lock-free concurrency control with a
> non-duplicates guarantee in hudi(only for Merge-On-Read tables).
>
>-
>
>With canIndexLogfiles index , multiple writers ingesting data into
>Merge-on-read tables can only append data to delta logs. This is a
>lock-free process if we can make sure they don’t write data to the same
> log
>file (plan to create multiple marker files to achieve this). And with
> log
>merge API(preCombine logic in Payload class), data in log files can be
> read
>properly
>-
>
>Since hudi already has an index type like Bucket index which can map
>key-bucket in a consistent way.  Data duplicates can be eliminated
>
>
> Thanks,
> Jian Feng
>


[DISCUSS] New RFC to support Lock-free concurrency control on Merge-on-read tables

2022-03-21 Thread 冯健
Hi team,

The situation is Optimistic concurrency control(OCC) has some limitation

   -

   When conflicts do occur, they may waste massive resources during every
   attempt (lakehouse-concurrency-control-are-we-too-optimistic
   

   ).
   -

   multiple writers may cause data duplicates when records with same new
   record-key arrives.multi-writer-guarantees
   

There is some background information, with OCC, we assume Multiple writers
won't write data to same FileID in most of time, if there is a FileId level
conflict, the commit will be rollbacked. and FileID level conflict can't
guarantee no duplicate if two records with same new record-key arrives in
multiple writers, since the mapping of key-bucket is not consistent with
bloom index.

What I plan to do is support Lock-free concurrency control with a
non-duplicates guarantee in hudi(only for Merge-On-Read tables).

   -

   With canIndexLogfiles index , multiple writers ingesting data into
   Merge-on-read tables can only append data to delta logs. This is a
   lock-free process if we can make sure they don’t write data to the same log
   file (plan to create multiple marker files to achieve this). And with log
   merge API(preCombine logic in Payload class), data in log files can be read
   properly
   -

   Since hudi already has an index type like Bucket index which can map
   key-bucket in a consistent way.  Data duplicates can be eliminated


Thanks,
Jian Feng