[ https://issues.apache.org/jira/browse/FLINK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470565#comment-16470565 ]
ASF GitHub Bot commented on FLINK-9043: --------------------------------------- GitHub user sihuazhou opened a pull request: https://github.com/apache/flink/pull/5987 [FLINK-9043][CLI]Automatically search for the last successful checkpoint when recover the job from externalized checkpoint ## What is the purpose of the change *Automatically search for the last successful checkpoint from the given checkpoint pointer, this would be more friendly when user try to recover job from the externalized checkpoint.* ## Brief change log - *Automatically search for the last successful checkpoint when recover the job from externalized checkpoint* ## Verifying this change This change extended the `AbstractFileCheckpointStorageTestBase#testPointerPathResolution()`. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (yes / no / don't know) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (yes) - I will add the doc, once the reviews is almost complete. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sihuazhou/flink FLINK-9043 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5987.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5987 ---- commit d95f76eceb851a1e707f78bce5085b8b09d8464b Author: sihuazhou <summerleafs@...> Date: 2018-05-10T05:17:09Z generate the meta file only when the writing is truly successful. commit 478e746c7980beeca700c9c492a47e5a380c7efb Author: sihuazhou <summerleafs@...> Date: 2018-05-10T15:30:17Z support searching for the last successful checkpoint from the sub-directory of the given checkpoint pointer. ---- > Introduce a friendly way to resume the job from externalized checkpoints > automatically > -------------------------------------------------------------------------------------- > > Key: FLINK-9043 > URL: https://issues.apache.org/jira/browse/FLINK-9043 > Project: Flink > Issue Type: New Feature > Reporter: godfrey johnson > Assignee: Sihua Zhou > Priority: Major > > I know a flink job can reovery from checkpoint with restart strategy, but can > not recovery as spark streaming jobs when job is starting. > Every time, the submitted flink job is regarded as a new job, while , in the > spark streaming job, which can detect the checkpoint directory first, and > then recovery from the latest succeed one. However, Flink only can recovery > until the job failed first, then retry with strategy. > > So, would flink support to recover from the checkpoint directly in a new job? > h2. New description by [~sihuazhou] > Currently, it's quite a bit not friendly for users to recover job from the > externalized checkpoint, user need to find the dedicate dir for the job which > is not a easy thing when there are too many jobs. This ticket attend to > introduce a more friendly way to allow the user to use the externalized > checkpoint to do recovery. > The implementation steps are copied from the comments of [~StephanEwen]: > - We could make this an option where you pass a flag (-r) to automatically > look for the latest checkpoint in a given directory. > - If more than one jobs checkpointed there before, this operation would fail. > - We might also need a way to have jobs not create the UUID subdirectory, > otherwise the scanning for the latest checkpoint would not easily work. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)