nsivabalan commented on issue #2992: URL: https://github.com/apache/hudi/issues/2992#issuecomment-930123547
I could not reproduce in latest master. https://gist.github.com/nsivabalan/23caa2f57c41bc9356ed7fa29590c147 Here is my understanding. INSERT_DROP_DUPES will delete records from incoming df with those matching in existing hudi table. when this is used along with INSERT_OVERRIDE operation, first insert_drop_dupes kicks in and so, possible some records from incoming batch will be dropped. and then INSERT_OVERRIDE is performed. and any matching partitions will be overritten. In my gist link, I did not use insert_drop_dupes for INSERT_OVERRIDE, just to show that it works. You need to set combine.before.insert/upsert to true to drop duplicates among incoming batch. Here is the output if I use insert_drop_dupes with insert_override +------+---------+---+ |typeId|recordKey|str| +------+---------+---+ |2 |key4 |mno| |1 |key1 |def| |3 |key5 |pqr| +------+---------+---+ As you could see, key2 is not present here, bcoz, it was dropped since it was already in hudi table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
