[
https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shizhi Chen updated HUDI-4287:
------------------------------
Description:
*Problem reveiw*
CkpMetadata is introduced into flink module to reduce timeline burden, but
currently its
mechanism lacks corresponding status for rollback instants, which may result in
commit/delta commit instants deletion, and thus
StreamWriteOperatorCoordinator(meta end) and Write function(data end) will not
be coordinatited correctly.
Finally, data files will be deleted by mistake.
This situation will be easy to reproduced especially when
StreamWriteOperatorCoordinator schedules table services for a long time between
commit and init instants after the restoration from a checkpoint.
*Stable Reproduction Proccedure*
* a. Before starting a job, let's modify the
StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
!image-2022-06-27-19-42-14-676.png|width=629,height=385!
It does nothing but to mock the possible long time table services for fast
reproduction.
* b. Start a simple flink hudi job such as append, and don't hesitate to kill
it when the 2nd checkpoint is in INFLIGHT.
* c. Let's restart it from the checkpoint restoration, it'll be sure to hit
the case after another 2 checkpoints, which may be accompanied by the
FileNotFoundException:
!image-2022-06-27-20-29-49-897.png|width=580,height=445!
More important, we could observe the incoordination:
!image-2022-06-27-20-07-55-984.png|width=593,height=125!
The screenshot above shows that the instant should be 20220531163135119 in
2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta
end.
!image-2022-06-27-20-11-47-939.png|width=590,height=177!
At the same time, the data files are written with the wrong base commit
instant: 20220531161923191, which is deleted during rollbacks in procedure c.
for its uncompletement and also should have been evicted from ckp_meta.
*Solution*
The solution is to optimize the mechanism with CANCELLED CkpMessage state in
the highest priority corresponding with DELETE instant during rollback action.
was:
*Problem reveiw*
CkpMetadata is introduced into flink module to reduce timeline burden, but
currently its
mechanism lacks corresponding status for rollback instants, which may result in
commit/delta commit instants deletion, and thus
StreamWriteOperatorCoordinator(meta end) and Write function(data end) will not
be coordinatited correctly.
Finally, data files will be deleted by mistake.
This situation will be easy to reproduced especially when
StreamWriteOperatorCoordinator schedules table services for a long time between
commit and init instants after the restoration from a checkpoint.
*Stable Reproduction Proccedure*
* a. Before starting a job, let's modify the
StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
!image-2022-06-27-19-42-14-676.png!
It does nothing but to mock the possible long time table services for fast
reproduction.
* b. Start a simple flink hudi job such as append, and don't hesitate to kill
it when the 2nd checkpoint is in INFLIGHT.
* c. Let's restart it from the checkpoint restoration, it'll be sure to hit
the case after another 2 checkpoints, which may be accompanied by the
FileNotFoundException:
!image-2022-06-27-20-29-49-897.png!
More important, we could observe the incoordination:
!image-2022-06-27-20-07-55-984.png!
The screenshot above shows that the instant should be 20220531163135119 in
2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta
end.
!image-2022-06-27-20-11-47-939.png!
At the same time, the data files are written with the wrong base commit
instant: 20220531161923191, which is deleted during rollbacks in procedure c.
for its uncompletement and also should have been evicted from ckp_meta.
*Solution*
The solution is to optimize the mechanism with CANCELLED CkpMessage state in
the highest priority corresponding with DELETE instant during rollback action.
> Optimize Flink checkpoint meta mechanism to fix mistaken pending instants
> -------------------------------------------------------------------------
>
> Key: HUDI-4287
> URL: https://issues.apache.org/jira/browse/HUDI-4287
> Project: Apache Hudi
> Issue Type: Bug
> Components: flink
> Reporter: Shizhi Chen
> Assignee: Shizhi Chen
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: image-2022-06-27-19-42-14-676.png,
> image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png,
> image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png
>
>
> *Problem reveiw*
> CkpMetadata is introduced into flink module to reduce timeline burden, but
> currently its
> mechanism lacks corresponding status for rollback instants, which may result
> in commit/delta commit instants deletion, and thus
> StreamWriteOperatorCoordinator(meta end) and Write function(data end) will
> not be coordinatited correctly.
> Finally, data files will be deleted by mistake.
> This situation will be easy to reproduced especially when
> StreamWriteOperatorCoordinator schedules table services for a long time
> between commit and init instants after the restoration from a checkpoint.
>
> *Stable Reproduction Proccedure*
> * a. Before starting a job, let's modify the
> StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
> !image-2022-06-27-19-42-14-676.png|width=629,height=385!
> It does nothing but to mock the possible long time table services for fast
> reproduction.
> * b. Start a simple flink hudi job such as append, and don't hesitate to
> kill it when the 2nd checkpoint is in INFLIGHT.
> * c. Let's restart it from the checkpoint restoration, it'll be sure to hit
> the case after another 2 checkpoints, which may be accompanied by the
> FileNotFoundException:
> !image-2022-06-27-20-29-49-897.png|width=580,height=445!
> More important, we could observe the incoordination:
> !image-2022-06-27-20-07-55-984.png|width=593,height=125!
> The screenshot above shows that the instant should be 20220531163135119 in
> 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta
> end.
> !image-2022-06-27-20-11-47-939.png|width=590,height=177!
> At the same time, the data files are written with the wrong base commit
> instant: 20220531161923191, which is deleted during rollbacks in procedure c.
> for its uncompletement and also should have been evicted from ckp_meta.
>
> *Solution*
> The solution is to optimize the mechanism with CANCELLED CkpMessage state in
> the highest priority corresponding with DELETE instant during rollback action.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)