[ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5635:
-----------------------------
    Attachment: master-log.txt
                agent-log.txt

> Agent repeatedly reregisters, possible one-way disconnection
> ------------------------------------------------------------
>
>                 Key: MESOS-5635
>                 URL: https://issues.apache.org/jira/browse/MESOS-5635
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Greg Mann
>              Labels: agent, mesosphere
>         Attachments: agent-log.txt, master-log.txt
>
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> After one such failure, the agent recovered and about a minute later the 
> following was observed in the master logs:
> {code}
> I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179)
> {code}
> However, we see nothing about registration in the agent logs at this time. 
> Subsequently, in the master logs, we see the agent continuing to reregister 
> every couple seconds:
> {code}
> I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179)
> {code}
> After about four minutes of this, we see:
> {code}
> I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
> {code}
> And after this point, we see repeated reregistration attempts from that agent 
> in the master logs:
> {code}
> W0617 22:29:09.514423  2010 master.cpp:4773] Agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179) attempted to re-register after removal;
> {code}
> During all of this, however, the agent logs indicate nothing about 
> registration. All we see are requests coming in to {{/state}}:
> {code}
> Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with 
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326   875 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with 
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465   873 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> {code}
> The lack of logging on the agent side, and the health check timeout, suggests 
> a one-way disconnection such that the master cannot send messages to the 
> agent, but the agent can send messages to the master. This behavior has been 
> observed several times on this test cluster in the past couple days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to