Hi guys,
Motivation Impove the merge performance for cow table when upsert, handle merge operation by using spark built-in operators. Background When do a upsert operation, for each bucket, hudi needs to put new input elements to memory cache map, and will need an external map that spills content to disk when there is insufficient space for it to grow. There are several performance issuses: 1. We may need an external disk map, serialize / deserialize records 2. Only single thread do the I/O operation when check 3. Can't take advantage of built-in spark operators Based on above, reworked the merge logic and done draft test. If you are also interested in this, please go ahead with this doc[1], any suggestion are welcome. :) Thanks, Lamber-Ken [1] https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
