n3nash commented on issue #3078: URL: https://github.com/apache/hudi/issues/3078#issuecomment-865418114
@tandonraghav Let me explain in details 1. `preCombine` -> This is used in the following code paths a) In-Memory deduping/merging of incoming records. The logic in preCombine decides how 2 records with the same record key will be deduped b) On-Disk deduping/merging of incoming records. The same logic in preCombine decides how 2 records will the same record key in log files will be merged/deduped. 2. `combineAndGetUpdateValue` -> This is used to merge the in-memory record with the one on disk. Ideally, you want to keep the merging logic of in-memory vs on-disk the same. Let's take the following use-case - You are ingesting 100 records per batch. Let's say out of those 100 records, 2 have the same record key. Now, if all the 100 records were part of the same batch, you would probably apply preCombine to dedup - whether in-memory or in log files. But what if the 2 records came in different batches. Now, you will apply `combineAndGetUpdateValue` to merge the 2 records. If your behavior is not the same in both the implementations, you can get different results. The reason to keep both of these API's different was to provide more flexibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org