Anand Mazumdar created MESOS-7057:
-------------------------------------
Summary: Consider using the relink in the executor driver.
Key: MESOS-7057
URL: https://issues.apache.org/jira/browse/MESOS-7057
Project: Mesos
Issue Type: Bug
Affects Versions: 1.1.0, 1.0.2
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar
As outlined in the root cause analysis for MESOS-5332, it is possible for a
iptables firewall to terminate an idle connection after a timeout. (the default
is 5 days). Once this happens, the executor driver is not notified of the
disconnection. It keeps on thinking that it is still connected with the agent.
When the agent process is restarted, the executor still tries to re-use the old
broken connection to send the re-register message to the agent. This is when it
eventually realizes that the connection is broken (due to the nature of TCP)
and calls the {{exited}} callback and commits suicide in 15 minutes upon the
recovery timeout.
To offset this, an executor should always {{relink}} when it receives a
reconnect request from the agent.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)