[jira] [Updated] (FLINK-24021) Potential job unrecoverable due to Network failure

Aitozi (Jira) Mon, 30 Aug 2021 07:26:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aitozi updated FLINK-24021:
---------------------------
    Description: 
Now we use zk to do leader election and retrieval for HA. And we register a 
fatalError handler in leaderElectionService and leaderRetrievalService to let 
jobManager or taskManager process exit at the time of some unexpected error.

But we don't do this at the time of curatorFrameworkClient#start in 
ZookeeperUtils. This may lead to some unexpected error like :

 
 # ZookeeperUtils start curator client, but failed by network loss, this will 
not throw exception now, because we do not register an error handler.
 # The network recover when master begin do leader election, so this will 
success
 # The leaderRetrieval begin to work by get_data, but this will not be 
executed, because the curator client start failed in phase 1.

 

So I think we should register a error handler in phase1 , so that we can fail 
fast. 

 

  was:
Now we use zk to do leader election and retrieval for HA. And we register a 
fatalError handler in leaderElectionService and leaderRetrievalService to let 
jobManager or taskManager process exit at the time of some unexpected error.

But we don't do this at the time of curatorFrameworkClient#start in 
ZookeeperUtils. This may lead to some unexpected error like :

 
 # ZookeeperUtils start curator client, but failed by network loss, this will 
not throw exception now, because we do not register a error handler.
 # The network recover when master begin do leader election, so this will 
success
 # The leaderRetrieval begin to work by get_data periodically, but this will 
not be executed , because the curator client start failed in phase 1.

 

So I think we should register a error handler in phase1 , so that we can fail 
fast. 

 


> Potential job unrecoverable due to Network failure
> --------------------------------------------------
>
>                 Key: FLINK-24021
>                 URL: https://issues.apache.org/jira/browse/FLINK-24021
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Aitozi
>            Priority: Critical
>
> Now we use zk to do leader election and retrieval for HA. And we register a 
> fatalError handler in leaderElectionService and leaderRetrievalService to let 
> jobManager or taskManager process exit at the time of some unexpected error.
> But we don't do this at the time of curatorFrameworkClient#start in 
> ZookeeperUtils. This may lead to some unexpected error like :
>  
>  # ZookeeperUtils start curator client, but failed by network loss, this will 
> not throw exception now, because we do not register an error handler.
>  # The network recover when master begin do leader election, so this will 
> success
>  # The leaderRetrieval begin to work by get_data, but this will not be 
> executed, because the curator client start failed in phase 1.
>  
> So I think we should register a error handler in phase1 , so that we can fail 
> fast. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-24021) Potential job unrecoverable due to Network failure

Reply via email to