[
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687389#comment-16687389
]
Joseph Wu commented on MESOS-7564:
----------------------------------
Historically, we've considered the agent<->executor connection to be reliable.
This is evident when you look at the agent's lack of handling for executor
disconnections. Currently, if an HTTP executor successfully registers, and
then closes its connection, the agent will consider the executor "RUNNING".
The agent will then merrily send all sorts of messages over the broken
connection (and onto the floor), including LaunchTask messages. The agent
might log warnings, but it does not attempt to reconnect (it can't). (The PID
executor does not have this problem, because libprocess will make transient
connections to send messages if the persistent connection breaks.)
If we are considering the agent<->executor connection to be unreliable, we
first need to add/test logic to handle executor disconnections. I believe it
may be sufficient to detect (even belatedly) disconnections on the agent, and
transition the agent's view of the executor from RUNNING to REGISTERING and
start the registration timeout. This would only be necessary for HTTP
executors.
-----
Next to handle cases where the connection is "connected" but dropping
packets... We will probably want to add heartbeats in both directions.
Just on the HTTP executor library, we have two connections to consider:
1) The SUBSCRIBE Call is one persistent connection where the executor sends one
Call, and receives a stream of Events. There is currently no Executor->Agent
traffic except the first request. This connection could probably use
heartbeating in both directions. Agent->Executor heartbeats may come in the
form of Events. Executor->Agent heartbeats will need to be something else
(like the heartbeating suggested here: https://reviews.apache.org/r/69183/ ).
2) Other calls go through a secondary connection. This persistent connection
is used to send any number of Calls and their subsequent responses (202
Accepted) back. When the executor discovers a disconnection here, it remakes
both connections. This connection does not need heartbeating or monitoring.
> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -----------------------------------------------------------------------------
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
> Issue Type: Bug
> Components: agent, executor
> Reporter: Anand Mazumdar
> Assignee: Joseph Wu
> Priority: Critical
> Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication.
> This is especially problematic in scenarios when IPFilters are enabled since
> the default conntrack keep alive timeout is 5 days. When that timeout
> elapses, the executor doesn't get notified via a socket disconnection when
> the agent process restarts. The executor would then get killed if it doesn't
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible
> way for fixing this issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)