[ https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721800#comment-16721800 ]
Greg Mann commented on MESOS-7564: ---------------------------------- {code} commit d82075bef6bb52e135427ff4916f684af4f9226b Author: Joseph Wu <jos...@mesosphere.io> Date: Thu Dec 13 16:34:21 2018 -0800 Added HEARTBEAT events and calls for the executor HTTP API. These new messages are meant to be backwards compatible, in that they won't cause crashes when new executors send heartbeats to old agents, or new agents send heartbeats to old executors. All recipients of these heartbeats are currently expected to ignore them, as their only purpose is to keep certain connections from being marked "stale" by network intermediaries. Review: https://reviews.apache.org/r/69463/ {code} {code} commit ba46deb2ba31bd7f3d9bff3db979a1e850eedf0c Author: Joseph Wu <jos...@mesosphere.io> Date: Thu Dec 13 16:34:24 2018 -0800 Refactored master and agent streaming connections. This moves the very similar `HttpConnection` classes inside the master and agent into a common header. The refactored `StreamingHttpConnection<Event>` is more explicitly named to avoid potentially clashing with the libprocess HTTP helpers. This also moves the master's heartbeater helper into a new header and transforms it into an RAII libprocess actor wrapper. The heartbeater depends on this `StreamingHttpConnection` and is currently used by the master for heartbeating the operator event stream and HTTP framework connection. A later patch will use this heartbeater for agent->executor heartbeats. Review: https://reviews.apache.org/r/69472/ {code} {code} commit a47c7dea6ccb3558464219f3c6edf376b2f55086 Author: Joseph Wu <jos...@mesosphere.io> Date: Thu Dec 13 16:34:36 2018 -0800 Added heartbeaters for agent and HTTP executors. This implements two separate heartbeaters for Executor Events (agent to executor) and Executor Calls (executor to agent). Both are set to non-configurable intervals of 30 minutes, which should be sufficient to keep the connections alive while not flooding logs with warnings if the executor/agent does not have this patch. Review: https://reviews.apache.org/r/69473/ {code} {code} commit 828a28cec699e11d16006a6596b9f88ff75c55c0 Author: Joseph Wu <jos...@mesosphere.io> Date: Thu Dec 13 16:34:43 2018 -0800 Added tests for agent/executor heartbeating. This adds two separate tests which check if the Agent sends heartbeats to HTTP executors, and if the HTTP executor driver sends heartbeats to the agent. Review: https://reviews.apache.org/r/69474/ {code} > Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication. > ----------------------------------------------------------------------------- > > Key: MESOS-7564 > URL: https://issues.apache.org/jira/browse/MESOS-7564 > Project: Mesos > Issue Type: Bug > Components: agent, executor > Reporter: Anand Mazumdar > Assignee: Joseph Wu > Priority: Critical > Labels: api, foundations, mesosphere, v1_api > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. > We should also update executor API documentation to explain the new behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)