[jira] [Commented] (FLINK-9043) Flink recover from checkpoint like Spark Streaming

Stephan Ewen (JIRA) Thu, 22 Mar 2018 02:09:26 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409254#comment-16409254
 ]


Stephan Ewen commented on FLINK-9043:
-------------------------------------

To elaborate a bit on why it currently is the way it is:
  - By default, Flink adds the Job ID (a UUID) to the path for checkpoints, to 
ensure that two jobs that have the same checkpoint directory configured can 
never collide. That has proven very important for resilience against 
misconfiguration.
  - Automatic resuming still happens (with HA enabled) if you don't terminate 
the flink job (cancel it, fail it terminally, finish it) but if you just kill 
all processes or containers. As soon as you start some processes again, the 
same job will recover as you described.

Nonetheless, this could be an interesting feature, definitely.

The fact that different jobs can have the same checkpoint root directory and 
have subdirectories (named by UUIDs) in there gets in the way of scanning the 
checkpoint directory when you re-submit the job (the resubmitted job will have 
a different UUID).

  - We could make this an option where you pass a flag (-r) to automatically 
look for the latest checkpoint in a given directory.
  - If more than one jobs checkpointed there before, this operation would fail.
  - We might also need a way to have jobs not create the UUID subdirectory, 
otherwise the scanning for the latest checkpoint would not easily work.

> Flink recover from checkpoint like Spark Streaming 
> ---------------------------------------------------
>
>                 Key: FLINK-9043
>                 URL: https://issues.apache.org/jira/browse/FLINK-9043
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: godfrey johnson
>            Priority: Major
>
> I know a flink job can reovery from checkpoint with restart strategy, but can 
> not recovery as spark streaming jobs when job is starting.
> Every time, the submitted flink job is regarded as a new job, while , in the 
> spark streaming  job, which can detect the checkpoint directory first,  and 
> then recovery from the latest succeed one. However, Flink only can recovery 
> until the job failed first, then retry with strategy.
>  
> So, would flink support to recover from the checkpoint directly in a new job?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9043) Flink recover from checkpoint like Spark Streaming

Reply via email to