[ 
https://issues.apache.org/jira/browse/FLINK-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374083#comment-16374083
 ] 

vinoyang commented on FLINK-7641:
---------------------------------

Hi [~StephanEwen] , about this issue, we implemented a feature : In v1.3.2 and 
v1.4.0, for standalone cluster (HA mode) deployment environment, Job Manager 
can hot standby for running jobs.

 

More details : Different from Flink's default behavior ( jm leader election 
will trigger all jobs fail then recovery), in our inner version we implemented 
all jobs which under RUNNING state will still keep RUNNING (other non-RUNNING 
and non-TERMINAL jobs also follow the default processing way) when jm leader 
election and we have verified this feature in our production environment. In 
our opinion, if active jm failure, all RUNNING jobs fail and recover, the cost 
is very expensive, especially most RUNNING jobs communicate with jm just for 
checkpointing and updating accmulator snapshot, exection state and some other 
non-critical purpose.

 

Implementation details: We do snapshot for running jobs' execution graph (just 
some key information) and persist to Zookeeper. And the snapshot will be 
synchronized to all standby jms at runtime. If active jm is failure, the new jm 
leader can also index all running jobs with their execution graph snapshot. 
Moreover we also snapshoted slot resource information.

 

However, in an awkward position, Flink will squint towards flip-6, the job and 
jobmanager will be one-to-one in session(standalone) mode. So the active jm's 
failure cost will be less than old version. But for each RUNNING job, it still 
can reduce the cost of failing and recovery.

 

Do you think this function has value in Flink's future plan. Hope for listening 
your comment. Thanks.

 

> Loss of JobManager in HA mode should not cause jobs to fail
> -----------------------------------------------------------
>
>                 Key: FLINK-7641
>                 URL: https://issues.apache.org/jira/browse/FLINK-7641
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>    Affects Versions: 1.3.2
>            Reporter: Elias Levy
>            Assignee: vinoyang
>            Priority: Major
>
> Currently if a standalone cluster of JobManagers is configured in 
> high-availability mode and the master JM is lost, the job executing in the 
> cluster will be restarted.  This is less than ideal.  It would be best if the 
> jobs could continue to execute without restarting while one of the spare JMs 
> becomes the new master, or in the worse case, the jobs are paused while the 
> JM election takes place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to