mtami opened a new issue #3266:
URL: https://github.com/apache/hudi/issues/3266


   I am using AWS DMS Change data Capture service to get change data from my 
database and then using Apache Hudi with AWS glue ETL job to process the change 
data and create a table in hive. I am using a pre-combine field as a timestamp 
sent from AWS DMS as when the data was committed (update_ts_dms).
   
   I have few use cases where Insert/Updates and Deletes for the same primary 
key are having the same timestamp sent from DMS and after change data 
processing apache hudi is not giving the latest updated data in the table or 
last added updated row for the same primary key in the table. It is adding any 
random insert or update in the table. May be due to the same primary key and 
the same pre-combine field.
   
   is there any suggested solution for such case?
   
   **Sample Data**:
   `{
   "Op": "I",
   "update_ts_dms": "2021-07-08 10:47:53",
   "id": 10125412,
   "brand_id": 9722520,
   "type": "EXPLICIT",
   "created": "2021-07-08 10:47:53",
   "updated": "2021-07-08 10:47:53"
   }
   {
   "Op": "D",
   "update_ts_dms": "2021-07-08 10:47:53",
   "id": 10125412,
   "brand_id": 9722520,
   "type": "EXPLICIT",
   "created": "2021-07-08 10:47:53",
   "updated": "2021-07-08 10:47:53"
   }
   `
   
   
   **Environment Description**
   
   * Hudi version (jar): hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar 
   
   * Spark version : 2.4
   
   * Hadoop version : 2.8
   
   * Storage : S3
   
   * Running on Docker?: no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to