[jira] [Updated] (HUDI-5686) Missing records when HoodieDeltaStreamer run in continuous

Purushotham Pushpavanthar (Jira) Thu, 02 Feb 2023 09:50:12 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Purushotham Pushpavanthar updated HUDI-5686:
--------------------------------------------
    Description: 
See issue [https://github.com/apache/hudi/issues/7757] for more details.

Description of the issue:
If the HoodieDeltaStreamer is forcefully terminated before commit instant's 
state is `COMPLETED`, it leaves the commit state in either `REQUESTED` or 
`INFLIGHT`. When the HoodieDeltaStreamer is rerun, the first successful commit 
writes first batch of records into Hudi Table. However, in the consecutive 
commit, the changes committed by previous commit disappears. This causes *loss 
of entire batch* of data written by the first commit after restart.
I observed this problem when HoodieDeltaStreamer is run in continuous mode and 
when job gets resubmitted when AM container gets killed due to reasons like 
loss of nodes or node going to unhealthy state. This issue is not limited to 
continuous mode alone, this can happen anytime when Hudi write gets terminated 
before instant is marked `COMPLETE`.

How to reproduce the issue:
 # Run HoodieDeltaStreamer and yarn kill the job before commit instant reaches 
`COMPLETE` state. Note the number of records after last successful commit (say 
100)
 # Upon re-submission of HoodieDeltaStreamer, there will be 2 new instants 
created (1 Commit complete and 1 rollback complete). Note the number of delta 
changes consumed(say 10 new records keys) in this run and total number of 
records in hudi table( 110 unique records )
 # On next run, wait till Hudi completes the commit assuming it received 5 
records and check the count of unique records in hudi table (It was observed to 
be 105). The delta records consumed in step 2 are entirely lost.

Reason:
Suppose Hudi is running and it's timeline looks like below, and you kill the job
 # C1.commit.requested
 # C1.inflight
 # C1.commit
 # C2.commit.requested
 # C2.inflight
 # C2.commit
 # C3.commit.requested

Upon re-submission, after 1 commit cycle the timeline looks like
 # C1.commit.requested
 # C1.inflight
 # C1.commit
 # C2.commit.requested
 # C2.inflight
 # C2.commit
 # R1.rollback.requested
 # R1.rollback.inflight
 # R1.rollback
 # C4.commit.requested
 # C4.inflight
 # C4.commit

The next commit cycle loads R1.rollback as the recent latest instant in the 
timeline and due to which the new incoming records gets UPSERTed on C2.commit 
instant rather than C4.commit. This is because, the chronological order of 
timestamps of rollback is greater than the commit that triggered it ( i.e. in 
the above example R1 > C4 ). This creates a cascading effect on data loss 
whilte the kafka consumer offsets keep moving ahead.
Refer to the commit timeline snapshot tagged in the github issue 
[7757|https://github.com/apache/hudi/issues/7757].

  was:
See issue [https://github.com/apache/hudi/issues/7757] for more details.

Description of the issue:
If the HoodieDeltaStreamer is forcefully terminated before commit instant's 
state is `COMPLETED`, it leaves the commit state in either `REQUESTED` or  
`INFLIGHT`. When the HoodieDeltaStreamer is rerun, the first successful commit 
writes first batch of records into Hudi Table. However, in the consecutive 
commit, the changes committed by previous commit disappears. This causes *loss 
of entire batch* of data written by the first commit after restart.
I observed this problem when HoodieDeltaStreamer is run in continuous mode and 
when job gets resubmitted when AM container gets killed due to reasons like 
loss of nodes or node going to unhealthy state. This issue is not limited to 
continuous mode alone, this can happen anytime when Hudi write gets terminated 
before instant is marked `COMPLETE`.

How to reproduce the issue:
# Run HoodieDeltaStreamer and yarn kill the job before commit instant reaches 
`COMPLETE` state. Note the number of records after last successful commit (say 
100)
# Upon re-submission of HoodieDeltaStreamer, there will be 2 new instants 
created (1 Commit complete and 1 rollback complete). Note the number of delta 
changes consumed(say 10 new records keys) in this run and total number of 
records in hudi table( 110 unique records )
# On next run, wait till Hudi completes the commit assuming it received 5 
records and check the count of unique records in hudi table (It was observed to 
be 105). The delta records consumed in step 2 are entirely lost.

Reason:
To explain what happens internally lets take an example of a commit timeline 
shown below





> Missing records when HoodieDeltaStreamer run in continuous
> ----------------------------------------------------------
>
>                 Key: HUDI-5686
>                 URL: https://issues.apache.org/jira/browse/HUDI-5686
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: deltastreamer
>            Reporter: Sagar Sumit
>            Assignee: Purushotham Pushpavanthar
>            Priority: Critical
>             Fix For: 0.13.1
>
>
> See issue [https://github.com/apache/hudi/issues/7757] for more details.
> Description of the issue:
> If the HoodieDeltaStreamer is forcefully terminated before commit instant's 
> state is `COMPLETED`, it leaves the commit state in either `REQUESTED` or 
> `INFLIGHT`. When the HoodieDeltaStreamer is rerun, the first successful 
> commit writes first batch of records into Hudi Table. However, in the 
> consecutive commit, the changes committed by previous commit disappears. This 
> causes *loss of entire batch* of data written by the first commit after 
> restart.
> I observed this problem when HoodieDeltaStreamer is run in continuous mode 
> and when job gets resubmitted when AM container gets killed due to reasons 
> like loss of nodes or node going to unhealthy state. This issue is not 
> limited to continuous mode alone, this can happen anytime when Hudi write 
> gets terminated before instant is marked `COMPLETE`.
> How to reproduce the issue:
>  # Run HoodieDeltaStreamer and yarn kill the job before commit instant 
> reaches `COMPLETE` state. Note the number of records after last successful 
> commit (say 100)
>  # Upon re-submission of HoodieDeltaStreamer, there will be 2 new instants 
> created (1 Commit complete and 1 rollback complete). Note the number of delta 
> changes consumed(say 10 new records keys) in this run and total number of 
> records in hudi table( 110 unique records )
>  # On next run, wait till Hudi completes the commit assuming it received 5 
> records and check the count of unique records in hudi table (It was observed 
> to be 105). The delta records consumed in step 2 are entirely lost.
> Reason:
> Suppose Hudi is running and it's timeline looks like below, and you kill the 
> job
>  # C1.commit.requested
>  # C1.inflight
>  # C1.commit
>  # C2.commit.requested
>  # C2.inflight
>  # C2.commit
>  # C3.commit.requested
> Upon re-submission, after 1 commit cycle the timeline looks like
>  # C1.commit.requested
>  # C1.inflight
>  # C1.commit
>  # C2.commit.requested
>  # C2.inflight
>  # C2.commit
>  # R1.rollback.requested
>  # R1.rollback.inflight
>  # R1.rollback
>  # C4.commit.requested
>  # C4.inflight
>  # C4.commit
> The next commit cycle loads R1.rollback as the recent latest instant in the 
> timeline and due to which the new incoming records gets UPSERTed on C2.commit 
> instant rather than C4.commit. This is because, the chronological order of 
> timestamps of rollback is greater than the commit that triggered it ( i.e. in 
> the above example R1 > C4 ). This creates a cascading effect on data loss 
> whilte the kafka consumer offsets keep moving ahead.
> Refer to the commit timeline snapshot tagged in the github issue 
> [7757|https://github.com/apache/hudi/issues/7757].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5686) Missing records when HoodieDeltaStreamer run in continuous

Reply via email to