mtami opened a new issue #3266:
URL: https://github.com/apache/hudi/issues/3266
I am using AWS DMS Change data Capture service to get change data from my
database and then using Apache Hudi with AWS glue ETL job to process the change
data and create a table in hive. I am using a pre-combine field as a timestamp
sent from AWS DMS as when the data was committed (update_ts_dms).
I have few use cases where Insert/Updates and Deletes for the same primary
key are having the same timestamp sent from DMS and after change data
processing apache hudi is not giving the latest updated data in the table or
last added updated row for the same primary key in the table. It is adding any
random insert or update in the table. May be due to the same primary key and
the same pre-combine field.
is there any suggested solution for such case?
**Sample Data**:
`{
"Op": "I",
"update_ts_dms": "2021-07-08 10:47:53",
"id": 10125412,
"brand_id": 9722520,
"type": "EXPLICIT",
"created": "2021-07-08 10:47:53",
"updated": "2021-07-08 10:47:53"
}
{
"Op": "D",
"update_ts_dms": "2021-07-08 10:47:53",
"id": 10125412,
"brand_id": 9722520,
"type": "EXPLICIT",
"created": "2021-07-08 10:47:53",
"updated": "2021-07-08 10:47:53"
}
`
**Environment Description**
* Hudi version (jar): hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar
* Spark version : 2.4
* Hadoop version : 2.8
* Storage : S3
* Running on Docker?: no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]