[ 
https://issues.apache.org/jira/browse/HUDI-5686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5686.
----------------------------
    Fix Version/s: 0.12.0
                       (was: 0.13.1)
       Resolution: Duplicate

> Missing records when HoodieDeltaStreamer run in continuous
> ----------------------------------------------------------
>
>                 Key: HUDI-5686
>                 URL: https://issues.apache.org/jira/browse/HUDI-5686
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: deltastreamer
>            Reporter: Sagar Sumit
>            Assignee: Purushotham Pushpavanthar
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.12.0
>
>
> See issue [https://github.com/apache/hudi/issues/7757] for more details.
> Description of the issue:
> If the HoodieDeltaStreamer is forcefully terminated before commit instant's 
> state is `COMPLETED`, it leaves the commit state in either `REQUESTED` or 
> `INFLIGHT`. When the HoodieDeltaStreamer is rerun, the first successful 
> commit writes first batch of records into Hudi Table. However, in the 
> consecutive commit, the changes committed by previous commit disappears. This 
> causes *loss of entire batch* of data written by the first commit after 
> restart.
> I observed this problem when HoodieDeltaStreamer is run in continuous mode 
> and when job gets resubmitted when AM container gets killed due to reasons 
> like loss of nodes or node going to unhealthy state. This issue is not 
> limited to continuous mode alone, this can happen anytime when Hudi write 
> gets terminated before instant is marked `COMPLETE`.
> How to reproduce the issue:
>  # Run HoodieDeltaStreamer and yarn kill the job before commit instant 
> reaches `COMPLETE` state. Note the number of records after last successful 
> commit (say 100)
>  # Upon re-submission of HoodieDeltaStreamer, there will be 2 new instants 
> created (1 Commit complete and 1 rollback complete). Note the number of delta 
> changes consumed(say 10 new records keys) in this run and total number of 
> records in hudi table( 110 unique records )
>  # On next run, wait till Hudi completes the commit assuming it received 5 
> records and check the count of unique records in hudi table (It was observed 
> to be 105). The delta records consumed in step 2 are entirely lost.
> Reason:
> Suppose Hudi is running and it's timeline looks like below, and you kill the 
> job
>  # C1.commit.requested
>  # C1.inflight
>  # C1.commit
>  # C2.commit.requested
>  # C2.inflight
>  # C2.commit
>  # C3.commit.requested
> Upon re-submission, after 1 commit cycle the timeline looks like
>  # C1.commit.requested
>  # C1.inflight
>  # C1.commit
>  # C2.commit.requested
>  # C2.inflight
>  # C2.commit
>  # R1.rollback.requested
>  # R1.rollback.inflight
>  # R1.rollback
>  # C4.commit.requested
>  # C4.inflight
>  # C4.commit
> The next commit cycle loads R1.rollback as the recent latest instant in the 
> timeline and due to which the new incoming records gets UPSERTed on C2.commit 
> instant rather than C4.commit. This is because, the chronological order of 
> timestamps of rollback is greater than the commit that triggered it ( i.e. in 
> the above example R1 > C4 ). This creates a cascading effect on data loss 
> whilte the kafka consumer offsets keep moving ahead.
> Refer to the commit timeline snapshot tagged in the github issue 
> [7757|https://github.com/apache/hudi/issues/7757].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to