[ https://issues.apache.org/jira/browse/FLINK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aitozi updated FLINK-24021: --------------------------- Description: Now we use zk to do leader election and retrieval for HA. And we register a fatalError handler in leaderElectionService and leaderRetrievalService to let jobManager or taskManager process exit at the time of some unexpected error. But we don't do this at the time of curatorFrameworkClient#start in ZookeeperUtils. This may lead to some unexpected error like : # ZookeeperUtils start curator client, but failed by network loss, this will not throw exception now, because we do not register an error handler. # The network recover when master begin do leader election, so this will success # The leaderRetrieval begin to work by get_data, but this will not be executed, because the curator client start failed in phase 1. So I think we should register a error handler in phase1 , so that we can fail fast. was: Now we use zk to do leader election and retrieval for HA. And we register a fatalError handler in leaderElectionService and leaderRetrievalService to let jobManager or taskManager process exit at the time of some unexpected error. But we don't do this at the time of curatorFrameworkClient#start in ZookeeperUtils. This may lead to some unexpected error like : # ZookeeperUtils start curator client, but failed by network loss, this will not throw exception now, because we do not register a error handler. # The network recover when master begin do leader election, so this will success # The leaderRetrieval begin to work by get_data periodically, but this will not be executed , because the curator client start failed in phase 1. So I think we should register a error handler in phase1 , so that we can fail fast. > Potential job unrecoverable due to Network failure > -------------------------------------------------- > > Key: FLINK-24021 > URL: https://issues.apache.org/jira/browse/FLINK-24021 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Aitozi > Priority: Critical > > Now we use zk to do leader election and retrieval for HA. And we register a > fatalError handler in leaderElectionService and leaderRetrievalService to let > jobManager or taskManager process exit at the time of some unexpected error. > But we don't do this at the time of curatorFrameworkClient#start in > ZookeeperUtils. This may lead to some unexpected error like : > > # ZookeeperUtils start curator client, but failed by network loss, this will > not throw exception now, because we do not register an error handler. > # The network recover when master begin do leader election, so this will > success > # The leaderRetrieval begin to work by get_data, but this will not be > executed, because the curator client start failed in phase 1. > > So I think we should register a error handler in phase1 , so that we can fail > fast. > -- This message was sent by Atlassian Jira (v8.3.4#803005)