[
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Mann updated MESOS-5635:
-----------------------------
Summary: Agent repeatedly reregisters, possible one-way partition (was:
Agent failure during recovery prevents reregistration)
> Agent repeatedly reregisters, possible one-way partition
> --------------------------------------------------------
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
> Issue Type: Bug
> Reporter: Greg Mann
> Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in
> the agent code (MESOS-5629), regular segfaults were occurring on an agent.
> While the agent was recovering from one of these failures, it segfaulted
> again. After this time, we noticed that after beginning recovery, the agent
> did not print {{Finished recovery}}, and its logs did not show any indication
> of reregistering with the master. Looking at the master's logs, however, the
> following line was observed repeatedly, at intervals on the order of seconds:
> {code}
> W0617 21:27:07.010679 2016 master.cpp:4773] Agent
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051
> (10.10.0.87) attempted to re-register after removal; shutting it down
> {code}
> These re-registration attempts had no corresponding lines in the agent log.
> Subsequently deleting the contents of the agent's {{work_dir}} and restarting
> it led to a successful registration with a new agent ID:
> {code}
> I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at
> slave(1)@10.10.0.87:5051 (10.10.0.87) with id
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)