n3nash commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-865418114


   @tandonraghav Let me explain in details
   
   1. `preCombine` -> This is used in the following code paths a) In-Memory 
deduping/merging of incoming records. The logic in preCombine decides how 2 
records with the same record key will be deduped b) On-Disk deduping/merging of 
incoming records. The same logic in preCombine decides how 2 records will the 
same record key in log files will be merged/deduped. 
   2. `combineAndGetUpdateValue` -> This is used to merge the in-memory record 
with the one on disk. 
   
   Ideally, you want to keep the merging logic of in-memory vs on-disk the 
same. Let's take the following use-case - You are ingesting 100 records per 
batch. Let's say out of those 100 records, 2 have the same record key. Now, if 
all the 100 records were part of the same batch, you would probably apply 
preCombine to dedup - whether in-memory or in log files. But what if the 2 
records came in different batches. Now, you will apply 
`combineAndGetUpdateValue` to merge the 2 records. If your behavior is not the 
same in both the implementations, you can get different results.
   
   The reason to keep both of these API's different was to provide more 
flexibility. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to