[jira] [Updated] (HUDI-4287) Optimize Flink checkpoint meta mechanism to fix mistaken pending instants

Shizhi Chen (Jira) Mon, 27 Jun 2022 05:47:05 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shizhi Chen updated HUDI-4287:
------------------------------
    Description: 
*Problem reveiw*

CkpMetadata is introduced into flink module to reduce timeline burden, but 
currently its 
mechanism lacks corresponding status for rollback instants, which may result in 
commit/delta commit instants deletion, and thus 
StreamWriteOperatorCoordinator(meta end) and Write function(data end) will not 
be coordinatited correctly.

Finally, data files will be deleted by mistake.

This situation will be easy to reproduced especially when 
StreamWriteOperatorCoordinator schedules table services for a long time between 
commit and init instants after the restoration from a checkpoint.

 

*Stable Reproduction Proccedure*
 * a. Before starting a job, let's modify the 
StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
!image-2022-06-27-19-42-14-676.png|width=629,height=385! 
It does nothing but to mock the possible long time table services for fast 
reproduction.
 * b. Start a simple flink hudi job such as append, and don't hesitate to kill 
it when the 2nd checkpoint is in INFLIGHT.
 * c. Let's restart it from the checkpoint restoration, it'll be sure to hit 
the case after another 2 checkpoints, which may be accompanied by the 
FileNotFoundException:
!image-2022-06-27-20-29-49-897.png|width=580,height=445! 
More important, we could observe the incoordination:
!image-2022-06-27-20-07-55-984.png|width=593,height=125! 
The screenshot above shows that the instant should be 20220531163135119 in 
2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta 
end.
!image-2022-06-27-20-11-47-939.png|width=590,height=177! 
At the same time, the data files are written with the wrong base commit 
instant: 20220531161923191, which is deleted during rollbacks in procedure c. 
for its uncompletement and also should have been evicted from ckp_meta.

 

*Solution*
The solution is to optimize the mechanism with CANCELLED CkpMessage state in 
the highest priority corresponding with DELETE instant during rollback action.

  was:
*Problem reveiw*

CkpMetadata is introduced into flink module to reduce timeline burden, but 
currently its 
mechanism lacks corresponding status for rollback instants, which may result in 
commit/delta commit instants deletion, and thus 
StreamWriteOperatorCoordinator(meta end) and Write function(data end) will not 
be coordinatited correctly.

Finally, data files will be deleted by mistake.

This situation will be easy to reproduced especially when 
StreamWriteOperatorCoordinator schedules table services for a long time between 
commit and init instants after the restoration from a checkpoint.

 

*Stable Reproduction Proccedure*
 * a. Before starting a job, let's modify the 
StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
!image-2022-06-27-19-42-14-676.png! 
It does nothing but to mock the possible long time table services for fast 
reproduction.
 * b. Start a simple flink hudi job such as append, and don't hesitate to kill 
it when the 2nd checkpoint is in INFLIGHT.
 * c. Let's restart it from the checkpoint restoration, it'll be sure to hit 
the case after another 2 checkpoints, which may be accompanied by the 
FileNotFoundException:
!image-2022-06-27-20-29-49-897.png! 
More important, we could observe the incoordination:
!image-2022-06-27-20-07-55-984.png! 
The screenshot above shows that the instant should be 20220531163135119 in 
2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta 
end.
!image-2022-06-27-20-11-47-939.png! 
At the same time, the data files are written with the wrong base commit 
instant: 20220531161923191, which is deleted during rollbacks in procedure c. 
for its uncompletement and also should have been evicted from ckp_meta.

 

*Solution*
The solution is to optimize the mechanism with CANCELLED CkpMessage state in 
the highest priority corresponding with DELETE instant during rollback action.


> Optimize Flink checkpoint meta mechanism to fix mistaken pending instants
> -------------------------------------------------------------------------
>
>                 Key: HUDI-4287
>                 URL: https://issues.apache.org/jira/browse/HUDI-4287
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: flink
>            Reporter: Shizhi Chen
>            Assignee: Shizhi Chen
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.12.0
>
>         Attachments: image-2022-06-27-19-42-14-676.png, 
> image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png, 
> image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png
>
>
> *Problem reveiw*
> CkpMetadata is introduced into flink module to reduce timeline burden, but 
> currently its 
> mechanism lacks corresponding status for rollback instants, which may result 
> in commit/delta commit instants deletion, and thus 
> StreamWriteOperatorCoordinator(meta end) and Write function(data end) will 
> not be coordinatited correctly.
> Finally, data files will be deleted by mistake.
> This situation will be easy to reproduced especially when 
> StreamWriteOperatorCoordinator schedules table services for a long time 
> between commit and init instants after the restoration from a checkpoint.
>  
> *Stable Reproduction Proccedure*
>  * a. Before starting a job, let's modify the 
> StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
> !image-2022-06-27-19-42-14-676.png|width=629,height=385! 
> It does nothing but to mock the possible long time table services for fast 
> reproduction.
>  * b. Start a simple flink hudi job such as append, and don't hesitate to 
> kill it when the 2nd checkpoint is in INFLIGHT.
>  * c. Let's restart it from the checkpoint restoration, it'll be sure to hit 
> the case after another 2 checkpoints, which may be accompanied by the 
> FileNotFoundException:
> !image-2022-06-27-20-29-49-897.png|width=580,height=445! 
> More important, we could observe the incoordination:
> !image-2022-06-27-20-07-55-984.png|width=593,height=125! 
> The screenshot above shows that the instant should be 20220531163135119 in 
> 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta 
> end.
> !image-2022-06-27-20-11-47-939.png|width=590,height=177! 
> At the same time, the data files are written with the wrong base commit 
> instant: 20220531161923191, which is deleted during rollbacks in procedure c. 
> for its uncompletement and also should have been evicted from ckp_meta.
>  
> *Solution*
> The solution is to optimize the mechanism with CANCELLED CkpMessage state in 
> the highest priority corresponding with DELETE instant during rollback action.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4287) Optimize Flink checkpoint meta mechanism to fix mistaken pending instants

Reply via email to