[
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Mann updated MESOS-5635:
-----------------------------
Attachment: master-log.txt
agent-log.txt
> Agent repeatedly reregisters, possible one-way disconnection
> ------------------------------------------------------------
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
> Issue Type: Bug
> Reporter: Greg Mann
> Labels: agent, mesosphere
> Attachments: agent-log.txt, master-log.txt
>
>
> This issue was observed recently on an internal test cluster. Due to a bug in
> the agent code (MESOS-5629), regular segfaults were occurring on an agent.
> After one such failure, the agent recovered and about a minute later the
> following was observed in the master logs:
> {code}
> I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
> (10.10.0.179)
> {code}
> However, we see nothing about registration in the agent logs at this time.
> Subsequently, in the master logs, we see the agent continuing to reregister
> every couple seconds:
> {code}
> I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
> (10.10.0.179)
> {code}
> After about four minutes of this, we see:
> {code}
> I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
> {code}
> And after this point, we see repeated reregistration attempts from that agent
> in the master logs:
> {code}
> W0617 22:29:09.514423 2010 master.cpp:4773] Agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
> (10.10.0.179) attempted to re-register after removal;
> {code}
> During all of this, however, the agent logs indicate nothing about
> registration. All we see are requests coming in to {{/state}}:
> {code}
> Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> {code}
> The lack of logging on the agent side, and the health check timeout, suggests
> a one-way disconnection such that the master cannot send messages to the
> agent, but the agent can send messages to the master. This behavior has been
> observed several times on this test cluster in the past couple days.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)