[ https://issues.apache.org/jira/browse/FLINK-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938416#comment-15938416 ]
ASF GitHub Bot commented on FLINK-6174: --------------------------------------- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3599 -1 sorry. This needs to go to the drawing board (FLIP or detailed JIRA discussion) before we consider a change that is impacting the guarantees and failure mode so heavily. Some initial comments: - In proper HA, you need some service that "locks" the leader, otherwise you are vulnerable to the "split brain" problem where a network partition makes multiple JobManagers work as leaders, each with some TaskManagers. - In FLIP-6, we are introducing the `HighAvailabilityServices` to allow for multiple levels of guarantees with different implementations. I can see that introducing a highly-available but not split-brain-protected is interesting, but it should not replace any existing mode, but be a new mode. > Introduce a leader election service in yarn mode to make JobManager always > available > ------------------------------------------------------------------------------------ > > Key: FLINK-6174 > URL: https://issues.apache.org/jira/browse/FLINK-6174 > Project: Flink > Issue Type: Improvement > Components: JobManager > Reporter: Tao Wang > Assignee: Tao Wang > > Now in yarn mode, if we use zookeeper as high availability choice, it will > create a election service to get a leader depending on zookeeper election. > When zookeeper leader crashes or the connection between JobManager and > zookeeper instance was broken, JobManager's leadership will be revoked and > send a Disconnect message to TaskManager, which will cancle all running tasks > and make them waiting connection rebuild between JM and ZK. > In yarn mode, we have one and only JobManager(AM) in same time, and it should > be alwasy leader instead of elected through zookeeper. We can introduce a new > leader election service in yarn mode to achive that. -- This message was sent by Atlassian JIRA (v6.3.15#6346)