[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687389#comment-16687389
 ] 

Joseph Wu commented on MESOS-7564:
----------------------------------

Historically, we've considered the agent<->executor connection to be reliable.  
This is evident when you look at the agent's lack of handling for executor 
disconnections.  Currently, if an HTTP executor successfully registers, and 
then closes its connection, the agent will consider the executor "RUNNING".  
The agent will then merrily send all sorts of messages over the broken 
connection (and onto the floor), including LaunchTask messages.  The agent 
might log warnings, but it does not attempt to reconnect (it can't).  (The PID 
executor does not have this problem, because libprocess will make transient 
connections to send messages if the persistent connection breaks.)

If we are considering the agent<->executor connection to be unreliable, we 
first need to add/test logic to handle executor disconnections.  I believe it 
may be sufficient to detect (even belatedly) disconnections on the agent, and 
transition the agent's view of the executor from RUNNING to REGISTERING and 
start the registration timeout.  This would only be necessary for HTTP 
executors.

-----

Next to handle cases where the connection is "connected" but dropping 
packets...   We will probably want to add heartbeats in both directions.

Just on the HTTP executor library, we have two connections to consider:
1) The SUBSCRIBE Call is one persistent connection where the executor sends one 
Call, and receives a stream of Events.  There is currently no Executor->Agent 
traffic except the first request.  This connection could probably use 
heartbeating in both directions.  Agent->Executor heartbeats may come in the 
form of Events.  Executor->Agent heartbeats will need to be something else 
(like the heartbeating suggested here: https://reviews.apache.org/r/69183/ ).

2) Other calls go through a secondary connection.  This persistent connection 
is used to send any number of Calls and their subsequent responses (202 
Accepted) back.  When the executor discovers a disconnection here, it remakes 
both connections.  This connection does not need heartbeating or monitoring.


> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7564
>                 URL: https://issues.apache.org/jira/browse/MESOS-7564
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>            Reporter: Anand Mazumdar
>            Assignee: Joseph Wu
>            Priority: Critical
>              Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to