[
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler updated MESOS-7569:
-----------------------------------
Fix Version/s: 1.1.3
> Allow "old" executors with half-open connections to be preserved during agent
> upgrade / restart.
> ------------------------------------------------------------------------------------------------
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will
> experience these executors potentially being destroyed whenever the agent
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days
> (default conntrack tcp timeout). At this point, the connection is timedout
> and no longer tracked by conntrack. From what we've seen, if the agent stays
> up, the packets still flow between the executor and agent. However, once the
> agent restarts, in some cases (presence of a DROP rule, or some flavors of
> NATing), the executor does not receive the RST/FIN from the kernel and will
> hold a half-open TCP connection. At this point, when the executor responds to
> the reconnect message from the restarted agent, it's half-open TCP connection
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old"
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying
> of the reconnect message in the agent. This allows the old executor to
> correctly establish a link to agent, when the second reconnect message is
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from
> reaching the conntrack timeout (see MESOS-7568).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)