[
https://issues.apache.org/jira/browse/HUDI-5686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu closed HUDI-5686.
----------------------------
Fix Version/s: 0.12.0
(was: 0.13.1)
Resolution: Duplicate
> Missing records when HoodieDeltaStreamer run in continuous
> ----------------------------------------------------------
>
> Key: HUDI-5686
> URL: https://issues.apache.org/jira/browse/HUDI-5686
> Project: Apache Hudi
> Issue Type: Bug
> Components: deltastreamer
> Reporter: Sagar Sumit
> Assignee: Purushotham Pushpavanthar
> Priority: Critical
> Labels: pull-request-available
> Fix For: 0.12.0
>
>
> See issue [https://github.com/apache/hudi/issues/7757] for more details.
> Description of the issue:
> If the HoodieDeltaStreamer is forcefully terminated before commit instant's
> state is `COMPLETED`, it leaves the commit state in either `REQUESTED` or
> `INFLIGHT`. When the HoodieDeltaStreamer is rerun, the first successful
> commit writes first batch of records into Hudi Table. However, in the
> consecutive commit, the changes committed by previous commit disappears. This
> causes *loss of entire batch* of data written by the first commit after
> restart.
> I observed this problem when HoodieDeltaStreamer is run in continuous mode
> and when job gets resubmitted when AM container gets killed due to reasons
> like loss of nodes or node going to unhealthy state. This issue is not
> limited to continuous mode alone, this can happen anytime when Hudi write
> gets terminated before instant is marked `COMPLETE`.
> How to reproduce the issue:
> # Run HoodieDeltaStreamer and yarn kill the job before commit instant
> reaches `COMPLETE` state. Note the number of records after last successful
> commit (say 100)
> # Upon re-submission of HoodieDeltaStreamer, there will be 2 new instants
> created (1 Commit complete and 1 rollback complete). Note the number of delta
> changes consumed(say 10 new records keys) in this run and total number of
> records in hudi table( 110 unique records )
> # On next run, wait till Hudi completes the commit assuming it received 5
> records and check the count of unique records in hudi table (It was observed
> to be 105). The delta records consumed in step 2 are entirely lost.
> Reason:
> Suppose Hudi is running and it's timeline looks like below, and you kill the
> job
> # C1.commit.requested
> # C1.inflight
> # C1.commit
> # C2.commit.requested
> # C2.inflight
> # C2.commit
> # C3.commit.requested
> Upon re-submission, after 1 commit cycle the timeline looks like
> # C1.commit.requested
> # C1.inflight
> # C1.commit
> # C2.commit.requested
> # C2.inflight
> # C2.commit
> # R1.rollback.requested
> # R1.rollback.inflight
> # R1.rollback
> # C4.commit.requested
> # C4.inflight
> # C4.commit
> The next commit cycle loads R1.rollback as the recent latest instant in the
> timeline and due to which the new incoming records gets UPSERTed on C2.commit
> instant rather than C4.commit. This is because, the chronological order of
> timestamps of rollback is greater than the commit that triggered it ( i.e. in
> the above example R1 > C4 ). This creates a cascading effect on data loss
> whilte the kafka consumer offsets keep moving ahead.
> Refer to the commit timeline snapshot tagged in the github issue
> [7757|https://github.com/apache/hudi/issues/7757].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)