[ 
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7057:
----------------------------------
    Target Version/s: 1.1.2  (was: 1.1.2, 1.3.0, 1.2.1)
       Fix Version/s:     (was: 1.3.0)
                      1.2.0

1.2.x backport
{noformat}
commit 056978a0ff49c4c4bc436fdf461e1d987d0ffe6a
Author: Anand Mazumdar <[email protected]>
Date:   Fri Feb 10 17:10:49 2017 -0800

    Modified the executor driver to always relink on agent failover.

    A relink is needed in cases where a netfilter module like iptables
    can terminate the connection without notifying the executor. This
    results in the executor still trying to reuse the stale "half-open"
    connection upon receiving the reconnect message from the executor
    leading to the erroneous behavior.

    Review: https://reviews.apache.org/r/56568/
{noformat}

> Consider using the relink functionality of libprocess in the executor driver.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7057
>                 URL: https://issues.apache.org/jira/browse/MESOS-7057
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Anand Mazumdar
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>             Fix For: 1.2.0
>
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a 
> iptables firewall to terminate an idle connection after a timeout. (the 
> default is 5 days). Once this happens, the executor driver is not notified of 
> the disconnection. It keeps on thinking that it is still connected with the 
> agent.
> When the agent process is restarted, the executor still tries to re-use the 
> old broken connection to send the re-register message to the agent. This is 
> when it eventually realizes that the connection is broken (due to the nature 
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes 
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a 
> reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to