Christopher Weaver created HUDI-802:
---------------------------------------
Summary: AWSDmsTransformer does not handle insert -> delete of a
row in a single batch correctly
Key: HUDI-802
URL: https://issues.apache.org/jira/browse/HUDI-802
Project: Apache Hudi (incubating)
Issue Type: Bug
Components: DeltaStreamer
Reporter: Christopher Weaver
The provided AWSDmsAvroPayload class
([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
currently handles cases where the "Op" column is a "D" for updates, and
successfully removes the row from the resulting table.
However, when an insert is quickly followed by a delete on the row (e.g. DMS
processes them together and puts the update records together in the same
parquet file), the row incorrectly appears in the resulting table. In this
case, the record is not in the table and getInsertValue is called rather than
combineAndGetUpdateValue. Since the logic to check for a delete is in
combineAndGetUpdateValue, it is skipped and the delete is missed. Something
like this could fix this issue:
[https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)