[ https://issues.apache.org/jira/browse/FLINK-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374083#comment-16374083 ]
vinoyang commented on FLINK-7641: --------------------------------- Hi [~StephanEwen] , about this issue, we implemented a feature : In v1.3.2 and v1.4.0, for standalone cluster (HA mode) deployment environment, Job Manager can hot standby for running jobs. More details : Different from Flink's default behavior ( jm leader election will trigger all jobs fail then recovery), in our inner version we implemented all jobs which under RUNNING state will still keep RUNNING (other non-RUNNING and non-TERMINAL jobs also follow the default processing way) when jm leader election and we have verified this feature in our production environment. In our opinion, if active jm failure, all RUNNING jobs fail and recover, the cost is very expensive, especially most RUNNING jobs communicate with jm just for checkpointing and updating accmulator snapshot, exection state and some other non-critical purpose. Implementation details: We do snapshot for running jobs' execution graph (just some key information) and persist to Zookeeper. And the snapshot will be synchronized to all standby jms at runtime. If active jm is failure, the new jm leader can also index all running jobs with their execution graph snapshot. Moreover we also snapshoted slot resource information. However, in an awkward position, Flink will squint towards flip-6, the job and jobmanager will be one-to-one in session(standalone) mode. So the active jm's failure cost will be less than old version. But for each RUNNING job, it still can reduce the cost of failing and recovery. Do you think this function has value in Flink's future plan. Hope for listening your comment. Thanks. > Loss of JobManager in HA mode should not cause jobs to fail > ----------------------------------------------------------- > > Key: FLINK-7641 > URL: https://issues.apache.org/jira/browse/FLINK-7641 > Project: Flink > Issue Type: Improvement > Components: JobManager > Affects Versions: 1.3.2 > Reporter: Elias Levy > Assignee: vinoyang > Priority: Major > > Currently if a standalone cluster of JobManagers is configured in > high-availability mode and the master JM is lost, the job executing in the > cluster will be restarted. This is less than ideal. It would be best if the > jobs could continue to execute without restarting while one of the spare JMs > becomes the new master, or in the worse case, the jobs are paused while the > JM election takes place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)