Hi team, The situation is Optimistic concurrency control(OCC) has some limitation
- When conflicts do occur, they may waste massive resources during every attempt (lakehouse-concurrency-control-are-we-too-optimistic <https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic> ). - multiple writers may cause data duplicates when records with same new record-key arrives.multi-writer-guarantees <https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees> There is some background information, with OCC, we assume Multiple writers won't write data to same FileID in most of time, if there is a FileId level conflict, the commit will be rollbacked. and FileID level conflict can't guarantee no duplicate if two records with same new record-key arrives in multiple writers, since the mapping of key-bucket is not consistent with bloom index. What I plan to do is support Lock-free concurrency control with a non-duplicates guarantee in hudi(only for Merge-On-Read tables). - With canIndexLogfiles index , multiple writers ingesting data into Merge-on-read tables can only append data to delta logs. This is a lock-free process if we can make sure they don’t write data to the same log file (plan to create multiple marker files to achieve this). And with log merge API(preCombine logic in Payload class), data in log files can be read properly - Since hudi already has an index type like Bucket index which can map key-bucket in a consistent way. Data duplicates can be eliminated Thanks, Jian Feng