Hi team,

The situation is Optimistic concurrency control(OCC) has some limitation

   -

   When conflicts do occur, they may waste massive resources during every
   attempt (lakehouse-concurrency-control-are-we-too-optimistic
   
<https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic>
   ).
   -

   multiple writers may cause data duplicates when records with same new
   record-key arrives.multi-writer-guarantees
   <https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees>

There is some background information, with OCC, we assume Multiple writers
won't write data to same FileID in most of time, if there is a FileId level
conflict, the commit will be rollbacked. and FileID level conflict can't
guarantee no duplicate if two records with same new record-key arrives in
multiple writers, since the mapping of key-bucket is not consistent with
bloom index.

What I plan to do is support Lock-free concurrency control with a
non-duplicates guarantee in hudi(only for Merge-On-Read tables).

   -

   With canIndexLogfiles index , multiple writers ingesting data into
   Merge-on-read tables can only append data to delta logs. This is a
   lock-free process if we can make sure they don’t write data to the same log
   file (plan to create multiple marker files to achieve this). And with log
   merge API(preCombine logic in Payload class), data in log files can be read
   properly
   -

   Since hudi already has an index type like Bucket index which can map
   key-bucket in a consistent way.  Data duplicates can be eliminated


Thanks,
Jian Feng

Reply via email to