nsivabalan commented on issue #2992:
URL: https://github.com/apache/hudi/issues/2992#issuecomment-930123547


   I could not reproduce in latest master. 
   https://gist.github.com/nsivabalan/23caa2f57c41bc9356ed7fa29590c147
   
   Here is my understanding. 
   INSERT_DROP_DUPES will delete records from incoming df with those matching 
in existing hudi table. when this is used along with INSERT_OVERRIDE operation, 
first insert_drop_dupes kicks in and so, possible some records from incoming 
batch will be dropped. and then INSERT_OVERRIDE is performed. and any matching 
partitions will be overritten. In my gist link, I did not use insert_drop_dupes 
for INSERT_OVERRIDE, just to show that it works. You need to set 
combine.before.insert/upsert to true to drop duplicates among incoming batch. 
   
   Here is the output if I use insert_drop_dupes with insert_override 
   
   +------+---------+---+
   |typeId|recordKey|str|
   +------+---------+---+
   |2     |key4     |mno|
   |1     |key1     |def|
   |3     |key5     |pqr|
   +------+---------+---+
   
   As you could see, key2 is not present here, bcoz, it was dropped since it 
was already in hudi table. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to