[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721800#comment-16721800
 ] 

Greg Mann commented on MESOS-7564:
----------------------------------

{code}
commit d82075bef6bb52e135427ff4916f684af4f9226b
Author: Joseph Wu <jos...@mesosphere.io>
Date:   Thu Dec 13 16:34:21 2018 -0800

    Added HEARTBEAT events and calls for the executor HTTP API.

    These new messages are meant to be backwards compatible, in that
    they won't cause crashes when new executors send heartbeats to old
    agents, or new agents send heartbeats to old executors.  All recipients
    of these heartbeats are currently expected to ignore them, as their
    only purpose is to keep certain connections from being marked "stale"
    by network intermediaries.

    Review: https://reviews.apache.org/r/69463/
{code}

{code}
commit ba46deb2ba31bd7f3d9bff3db979a1e850eedf0c
Author: Joseph Wu <jos...@mesosphere.io>
Date:   Thu Dec 13 16:34:24 2018 -0800

    Refactored master and agent streaming connections.

    This moves the very similar `HttpConnection` classes inside the
    master and agent into a common header.  The refactored
    `StreamingHttpConnection<Event>` is more explicitly named to avoid
    potentially clashing with the libprocess HTTP helpers.

    This also moves the master's heartbeater helper into a new header
    and transforms it into an RAII libprocess actor wrapper.  The
    heartbeater depends on this `StreamingHttpConnection` and is currently
    used by the master for heartbeating the operator event stream
    and HTTP framework connection.  A later patch will use this heartbeater
    for agent->executor heartbeats.

    Review: https://reviews.apache.org/r/69472/
{code}

{code}
commit a47c7dea6ccb3558464219f3c6edf376b2f55086
Author: Joseph Wu <jos...@mesosphere.io>
Date:   Thu Dec 13 16:34:36 2018 -0800

    Added heartbeaters for agent and HTTP executors.

    This implements two separate heartbeaters for Executor Events (agent
    to executor) and Executor Calls (executor to agent).  Both are set to
    non-configurable intervals of 30 minutes, which should be sufficient
    to keep the connections alive while not flooding logs with warnings
    if the executor/agent does not have this patch.

    Review: https://reviews.apache.org/r/69473/
{code}

{code}
commit 828a28cec699e11d16006a6596b9f88ff75c55c0
Author: Joseph Wu <jos...@mesosphere.io>
Date:   Thu Dec 13 16:34:43 2018 -0800

    Added tests for agent/executor heartbeating.

    This adds two separate tests which check if the Agent sends heartbeats
    to HTTP executors, and if the HTTP executor driver sends heartbeats
    to the agent.

    Review: https://reviews.apache.org/r/69474/
{code}

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7564
>                 URL: https://issues.apache.org/jira/browse/MESOS-7564
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>            Reporter: Anand Mazumdar
>            Assignee: Joseph Wu
>            Priority: Critical
>              Labels: api, foundations, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.
> We should also update executor API documentation to explain the new behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to