Hi guys,


Motivation
Impove the merge performance for cow table when upsert, handle merge operation 
by using spark built-in operators.


Background
When do a upsert operation, for each bucket, hudi needs to put new input 
elements to memory cache map, and will 
need an external map that spills content to disk when there is insufficient 
space for it to grow. 


There are several performance issuses:
1. We may need an external disk map, serialize / deserialize records 
2. Only single thread do the I/O operation when check 
3. Can't take advantage of built-in spark operators 


Based on above, reworked the merge logic and done draft test.
If you are also interested in this, please go ahead with this doc[1], any 
suggestion are welcome. :)




Thanks,
Lamber-Ken


[1] 
https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing

Reply via email to