[
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rukletsov updated MESOS-7057:
---------------------------------------
Fix Version/s: 1.1.2
> Consider using the relink functionality of libprocess in the executor driver.
> -----------------------------------------------------------------------------
>
> Key: MESOS-7057
> URL: https://issues.apache.org/jira/browse/MESOS-7057
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.0.2, 1.1.0
> Reporter: Anand Mazumdar
> Assignee: Anand Mazumdar
> Labels: mesosphere
> Fix For: 1.1.2, 1.2.0
>
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a
> iptables firewall to terminate an idle connection after a timeout. (the
> default is 5 days). Once this happens, the executor driver is not notified of
> the disconnection. It keeps on thinking that it is still connected with the
> agent.
> When the agent process is restarted, the executor still tries to re-use the
> old broken connection to send the re-register message to the agent. This is
> when it eventually realizes that the connection is broken (due to the nature
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a
> reconnect request from the agent.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)