[
https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shizhi Chen updated HUDI-4287:
------------------------------
Status: In Progress (was: Open)
> Optimize Flink checkpoint meta mechanism to fix mistaken pending instants
> -------------------------------------------------------------------------
>
> Key: HUDI-4287
> URL: https://issues.apache.org/jira/browse/HUDI-4287
> Project: Apache Hudi
> Issue Type: Bug
> Components: flink
> Reporter: Shizhi Chen
> Assignee: Shizhi Chen
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: image-2022-06-27-19-42-14-676.png,
> image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png,
> image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png
>
>
> *Problem reveiw*
> CkpMetadata is introduced into flink module to reduce timeline burden, but
> currently its
> mechanism lacks corresponding status for rollback instants, which may result
> in commit/delta commit instants deletion, and thus
> StreamWriteOperatorCoordinator(meta end) and Write function(data end) will
> not be coordinatited correctly.
> Finally, data files will be deleted by mistake.
> This situation will be easy to reproduced especially when
> StreamWriteOperatorCoordinator schedules table services for a long time
> between commit and init instants after the restoration from a checkpoint.
>
> *Stable Reproduction Proccedure*
> * a. Before starting a job, let's modify the
> StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
> !image-2022-06-27-19-42-14-676.png|width=479,height=293!
> It does nothing but to mock the possible long time table services for fast
> reproduction.
> * b. Start a simple flink hudi job such as append, and don't hesitate to
> kill it when the 2nd checkpoint is in INFLIGHT.
> * c. Let's restart it from the checkpoint restoration, it'll be sure to hit
> the case after another 2 checkpoints, which may be accompanied by the
> FileNotFoundException:
> !image-2022-06-27-20-29-49-897.png|width=503,height=386!
> More important, we could observe the incoordination:
> !image-2022-06-27-20-07-55-984.png|width=517,height=109!
> The screenshot above shows that the instant should be 20220531163135119 in
> 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta
> end.
> !image-2022-06-27-20-11-47-939.png|width=517,height=155!
> At the same time, the data files are written with the wrong base commit
> instant: 20220531161923191, which is deleted during rollbacks in procedure c.
> for its uncompletement and also should have been evicted from ckp_meta.
>
> *Solution*
> The solution is to optimize the mechanism with CANCELLED CkpMessage state in
> the highest priority corresponding with DELETE instant during rollback action.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)