ArpitAdhikari opened a new issue, #11836:
URL: https://github.com/apache/hudi/issues/11836

   **Describe the problem you faced**
   
   We are using hudi in our cdc pipeline which consumes from MSK topics. We are 
getting duplicate entries in our tables.
   To brief the scenario, we are using Hudi v0.12.2 on emr-6.10.1. Our data is 
captured from MySQL events, which can be of insert or upsert type, also the 
spark application looks fine. The tables that we are seeing duplicates in are 
mostly MOR ones with BLOOM index.
   
   **Expected behavior**
   
   Any update to coming in to the pipeline should update the already existing 
record.
   
   **Environment Description
   The attachment here contains: 
   1. Query being used to detect the duplicates,
   2. Metafields,
   3. Sample output of the query showing duplicates
   4. Hudi configs
   
[Attachment.txt](https://github.com/user-attachments/files/16760403/Attachment.txt)
   
   **
   
   * Hudi version : 0.12.2
   
   * Spark version : Spark 3.3.1
   
   * Hive version : Hive 3.1.3
   
   * Storage (HDFS/S3/GCS..) : S3 
   
   * Running on Docker? (yes/no) : Yes
   
   
   **Additional context**
   
   * EMR Version: emr-6.10.1
   
   We have also tried removing the duplicates manually with a python script and 
reran the job, but after running for few days it again started inserting 
duplicate records.
   
   Screenshot of the latest timeline :
   
   <img width="1347" alt="Screenshot 2024-08-27 at 3 33 43 PM" 
src="https://github.com/user-attachments/assets/67e8c1d0-e82a-4467-9718-fdee6f90377f";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to