[
https://issues.apache.org/jira/browse/FLINK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aitozi updated FLINK-24021:
---------------------------
Description:
Now we use zk to do leader election and retrieval for HA. And we register a
fatalError handler in leaderElectionService and leaderRetrievalService to let
jobManager or taskManager process exit at the time of some unexpected error.
But we don't do this at the time of curatorFrameworkClient#start in
ZookeeperUtils. This may lead to some unexpected error like :
# ZookeeperUtils start curator client, but failed by network loss, this will
not throw exception now, because we do not register an error handler.
# The network recover when master begin do leader election, so this will
success
# The leaderRetrieval begin to work by get_data, but this will not be
executed, because the curator client start failed in phase 1.
So I think we should register a error handler in phase1 , so that we can fail
fast.
was:
Now we use zk to do leader election and retrieval for HA. And we register a
fatalError handler in leaderElectionService and leaderRetrievalService to let
jobManager or taskManager process exit at the time of some unexpected error.
But we don't do this at the time of curatorFrameworkClient#start in
ZookeeperUtils. This may lead to some unexpected error like :
# ZookeeperUtils start curator client, but failed by network loss, this will
not throw exception now, because we do not register a error handler.
# The network recover when master begin do leader election, so this will
success
# The leaderRetrieval begin to work by get_data periodically, but this will
not be executed , because the curator client start failed in phase 1.
So I think we should register a error handler in phase1 , so that we can fail
fast.
> Potential job unrecoverable due to Network failure
> --------------------------------------------------
>
> Key: FLINK-24021
> URL: https://issues.apache.org/jira/browse/FLINK-24021
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Aitozi
> Priority: Critical
>
> Now we use zk to do leader election and retrieval for HA. And we register a
> fatalError handler in leaderElectionService and leaderRetrievalService to let
> jobManager or taskManager process exit at the time of some unexpected error.
> But we don't do this at the time of curatorFrameworkClient#start in
> ZookeeperUtils. This may lead to some unexpected error like :
>
> # ZookeeperUtils start curator client, but failed by network loss, this will
> not throw exception now, because we do not register an error handler.
> # The network recover when master begin do leader election, so this will
> success
> # The leaderRetrieval begin to work by get_data, but this will not be
> executed, because the curator client start failed in phase 1.
>
> So I think we should register a error handler in phase1 , so that we can fail
> fast.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)