Thank you Sagar! Here is the issue - https://github.com/apache/hudi/issues/9859



    On Friday, October 13, 2023 at 01:52:24 AM EDT, sagar sumit 
<cod...@apache.org> wrote:  
 
 Hi Himabindu,
I am assuming your total data on storage is 700GB and not the incoming batch.
INSERT_DROP_DUPS does work with large data. However, it is more time-consuming 
as it needs to tag the incoming records to dedupe.
I would suggest creating a GitHub issue with Spark UI screenshots and 
datasource write configs.Also, it would be helpful if you could provide your 
use case for INSERT_DROP_DUPS.Maybe there is a better alternative.
Regards,Sagar

On Thu, Oct 12, 2023 at 3:42 AM Himabindu Kosuru <hkos...@yahoo.com.invalid> 
wrote:

Hi All,
We are using COW tables and INSERT_DROP_DUPS fails with HoodieUpsertException 
even on a 700 GB data. The data is partitioned and stored in GCS. 
Executors 150Exec memory 40gExec cores 8

Does INSERT_DROP_DUPS work with large data? Any recommendations to make it work 
such as spark config settings?

Thanks,Bindu
  

Reply via email to