[
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Mann updated MESOS-5635:
-----------------------------
Description:
This issue was observed recently on an internal test cluster. Due to a bug in
the agent code (MESOS-5629), regular segfaults were occurring on an agent.
After one such failure, the agent recovered and about a minute later the
following was observed in the master logs:
{code}
I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
(10.10.0.179)
{code}
However, we see nothing about registration in the agent logs at this time.
Subsequently, in the master logs, we see the agent continuing to reregister
every couple seconds:
{code}
I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
(10.10.0.179)
{code}
After about four minutes of this, we see:
{code}
I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
{code}
And after this point, we see repeated reregistration attempts from that agent
in the master logs:
{code}
W0617 22:29:09.514423 2010 master.cpp:4773] Agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
(10.10.0.179) attempted to re-register after removal;
{code}
During all of this, however, the agent logs indicate nothing about
registration. All we see are requests coming in to {{/state}}:
{code}
Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
{code}
The lack of logging on the agent side, and the health check timeout, suggests a
one-way disconnection such that the master cannot send messages to the agent,
but the agent can send messages to the master. This behavior has been observed
several times on this test cluster in the past couple days. Full master and
agent logs from the relevant time period have been attached.
was:
This issue was observed recently on an internal test cluster. Due to a bug in
the agent code (MESOS-5629), regular segfaults were occurring on an agent.
After one such failure, the agent recovered and about a minute later the
following was observed in the master logs:
{code}
I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
(10.10.0.179)
{code}
However, we see nothing about registration in the agent logs at this time.
Subsequently, in the master logs, we see the agent continuing to reregister
every couple seconds:
{code}
I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
(10.10.0.179)
{code}
After about four minutes of this, we see:
{code}
I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
{code}
And after this point, we see repeated reregistration attempts from that agent
in the master logs:
{code}
W0617 22:29:09.514423 2010 master.cpp:4773] Agent
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
(10.10.0.179) attempted to re-register after removal;
{code}
During all of this, however, the agent logs indicate nothing about
registration. All we see are requests coming in to {{/state}}:
{code}
Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
{code}
The lack of logging on the agent side, and the health check timeout, suggests a
one-way disconnection such that the master cannot send messages to the agent,
but the agent can send messages to the master. This behavior has been observed
several times on this test cluster in the past couple days.
> Agent repeatedly reregisters, possible one-way disconnection
> ------------------------------------------------------------
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
> Issue Type: Bug
> Reporter: Greg Mann
> Labels: agent, mesosphere
> Attachments: agent-log.txt, master-log.txt
>
>
> This issue was observed recently on an internal test cluster. Due to a bug in
> the agent code (MESOS-5629), regular segfaults were occurring on an agent.
> After one such failure, the agent recovered and about a minute later the
> following was observed in the master logs:
> {code}
> I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
> (10.10.0.179)
> {code}
> However, we see nothing about registration in the agent logs at this time.
> Subsequently, in the master logs, we see the agent continuing to reregister
> every couple seconds:
> {code}
> I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
> (10.10.0.179)
> {code}
> After about four minutes of this, we see:
> {code}
> I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
> {code}
> And after this point, we see repeated reregistration attempts from that agent
> in the master logs:
> {code}
> W0617 22:29:09.514423 2010 master.cpp:4773] Agent
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051
> (10.10.0.179) attempted to re-register after removal;
> {code}
> During all of this, however, the agent logs indicate nothing about
> registration. All we see are requests coming in to {{/state}}:
> {code}
> Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> {code}
> The lack of logging on the agent side, and the health check timeout, suggests
> a one-way disconnection such that the master cannot send messages to the
> agent, but the agent can send messages to the master. This behavior has been
> observed several times on this test cluster in the past couple days. Full
> master and agent logs from the relevant time period have been attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)