[
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022346#comment-17022346
]
Greg Mann commented on MESOS-10068:
-----------------------------------
Yea we should definitely be sending AGENT_REMOVED when agents are marked gone,
sounds like a bug to me. I created a ticket to track this: MESOS-10089
Regarding the unreachable agents, we may want to have an AGENT_UNREACHABLE
event to indicate this.
[~daltonmatos], we have a ticket here to track the design of the full agent
state diagram: MESOS-9556
That would be a great place to continue discussion, feel free to ping us there.
Unfortunately, I'm not sure when we might find time to work on that, but it's
definitely something we've been wanting to do for a while now.
> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal
> state
> -------------------------------------------------------------------------------
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.7.3, 1.8.2, 1.9.1
> Reporter: Dalton Matos Coelho Barreto
> Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>
> Looking at the documentation of the master {{/api/v1}} endpoint, the
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is
> supported for this endpoint, but when a new agent joins the cluster a
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not
> received by clients subscribed to the master API.
>
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the
> cluster but the master couldn't communicate with this agent, in this specific
> test there was a firewall blocking port {{5051}} on the slave, that is, no
> body was being able to tal to the slave on port {{5051}}.
>
> h2. Here are the steps do reproduce the problem
> * Start a new mesos master
> * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
> **
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type:
> application/json" http://MASTER_IP:5050/api/v1{noformat}
> * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
> * Stop this slave;
> * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the
> field {{active=false}}.
> * Waits for mesos master stop listing this slave, that is,
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>
> The mesos master logs shows this:
> {noformat}
> I1213 15:03:10.338935 13 master.cpp:1297] Agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051
> (86813ca2a964) disconnected
> I1213 15:03:10.339089 13 master.cpp:3399] Disconnecting agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051
> (86813ca2a964)
> I1213 15:03:10.339207 13 master.cpp:3418] Deactivating agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.726670 15 process.cpp:1917] Failed to send
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat}
>
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>
> I will attach the full master logs also.
>
> Do you think this could be a bug?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)