[
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721800#comment-16721800
]
Greg Mann commented on MESOS-7564:
----------------------------------
{code}
commit d82075bef6bb52e135427ff4916f684af4f9226b
Author: Joseph Wu <[email protected]>
Date: Thu Dec 13 16:34:21 2018 -0800
Added HEARTBEAT events and calls for the executor HTTP API.
These new messages are meant to be backwards compatible, in that
they won't cause crashes when new executors send heartbeats to old
agents, or new agents send heartbeats to old executors. All recipients
of these heartbeats are currently expected to ignore them, as their
only purpose is to keep certain connections from being marked "stale"
by network intermediaries.
Review: https://reviews.apache.org/r/69463/
{code}
{code}
commit ba46deb2ba31bd7f3d9bff3db979a1e850eedf0c
Author: Joseph Wu <[email protected]>
Date: Thu Dec 13 16:34:24 2018 -0800
Refactored master and agent streaming connections.
This moves the very similar `HttpConnection` classes inside the
master and agent into a common header. The refactored
`StreamingHttpConnection<Event>` is more explicitly named to avoid
potentially clashing with the libprocess HTTP helpers.
This also moves the master's heartbeater helper into a new header
and transforms it into an RAII libprocess actor wrapper. The
heartbeater depends on this `StreamingHttpConnection` and is currently
used by the master for heartbeating the operator event stream
and HTTP framework connection. A later patch will use this heartbeater
for agent->executor heartbeats.
Review: https://reviews.apache.org/r/69472/
{code}
{code}
commit a47c7dea6ccb3558464219f3c6edf376b2f55086
Author: Joseph Wu <[email protected]>
Date: Thu Dec 13 16:34:36 2018 -0800
Added heartbeaters for agent and HTTP executors.
This implements two separate heartbeaters for Executor Events (agent
to executor) and Executor Calls (executor to agent). Both are set to
non-configurable intervals of 30 minutes, which should be sufficient
to keep the connections alive while not flooding logs with warnings
if the executor/agent does not have this patch.
Review: https://reviews.apache.org/r/69473/
{code}
{code}
commit 828a28cec699e11d16006a6596b9f88ff75c55c0
Author: Joseph Wu <[email protected]>
Date: Thu Dec 13 16:34:43 2018 -0800
Added tests for agent/executor heartbeating.
This adds two separate tests which check if the Agent sends heartbeats
to HTTP executors, and if the HTTP executor driver sends heartbeats
to the agent.
Review: https://reviews.apache.org/r/69474/
{code}
> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -----------------------------------------------------------------------------
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
> Issue Type: Bug
> Components: agent, executor
> Reporter: Anand Mazumdar
> Assignee: Joseph Wu
> Priority: Critical
> Labels: api, foundations, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication.
> This is especially problematic in scenarios when IPFilters are enabled since
> the default conntrack keep alive timeout is 5 days. When that timeout
> elapses, the executor doesn't get notified via a socket disconnection when
> the agent process restarts. The executor would then get killed if it doesn't
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible
> way for fixing this issue.
> We should also update executor API documentation to explain the new behavior.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)