[ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5635:
--------------------------
    Description: 
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
While the agent was recovering from one of these failures, it segfaulted again. 
After this time, we noticed that after beginning recovery, the agent did not 
print {{Finished recovery}}, and its logs did not show any indication of 
reregistering with the master. Looking at the master's logs, however, the 
following line was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
(10.10.0.87) attempted to re-register after removal; shutting it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
it led to a successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}

  was:
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
While the agent was recovering from one of these failures, it segfaulted again. 
After this time, we noticed that after recovery, the agent did not print 
{{Finished recovery}}, and its logs did not show any indication of 
reregistering with the master. Looking at the master's logs, however, the 
following line was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
(10.10.0.87) attempted to re-register after removal; shutting it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
it led to a successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}


> Agent failure during recovery prevents reregistration
> -----------------------------------------------------
>
>                 Key: MESOS-5635
>                 URL: https://issues.apache.org/jira/browse/MESOS-5635
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Greg Mann
>              Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> While the agent was recovering from one of these failures, it segfaulted 
> again. After this time, we noticed that after beginning recovery, the agent 
> did not print {{Finished recovery}}, and its logs did not show any indication 
> of reregistering with the master. Looking at the master's logs, however, the 
> following line was observed repeatedly, at intervals on the order of seconds:
> {code}
> W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
> (10.10.0.87) attempted to re-register after removal; shutting it down
> {code}
> These re-registration attempts had no corresponding lines in the agent log.
> Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
> it led to a successful registration with a new agent ID:
> {code}
> I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
> slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to