[
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018908#comment-14018908
]
Yuval Pavel Zholkover edited comment on MESOS-1219 at 6/5/14 5:28 PM:
----------------------------------------------------------------------
Hi,
Attached a log excerpt from a mesos-master 0.16.0 after a Zookeeper cluster
hick-up:
{noformat}
I0605 00:05:16.398090 24908 master.cpp:872] Re-registering framework
background_0 at scheduler(1)@xxx.xxx.xxx.xxx:48561
I0605 00:05:16.398108 24908 master.cpp:910] Allowing the Framework background_0
to re-register with an already used id
W0605 00:05:16.774268 24908 master.cpp:1393] Slave at
slave(1)@xxx.xxx.xxx.xxx:5044 (deb015) is being allowed to re-register with an
already in use id (201406030648-1821603594-5043-25313-1)
W0605 00:05:16.774646 24908 master.cpp:2384] Slave
201406030648-1821603594-5043-25313-1 (deb015) re-registered with completed
framework background_0. Shutting down the framework on the slave
I0605 00:05:17.522913 24915 master.cpp:1592] Executor executor-background_0 of
framework background_0 on slave 201406030648-1821603594-5043-25313-1 (deb015)
has terminated with signal Real-time signal 9
{noformat}
We are re-registering a scheduler with the same frameworkId as a previously
failed one (background_0) - this is a mistake on our part.
The master forces the re-registered slave to kill the background_0 executor as
the background_0 frameworkId is already in the completedFrameworks
circular_buffer. No TASK_LOST/TASK_KILLED are being sent to the re-registered
scheduler.
Also I'm not sure if there's another issue for this, but executorLost callbacks
are never get called as it is.
The workaround is to reset all the masters to clear their completedFrameworks
state, and stop re-using failed frameworkId's. Or alternatively not to set the
failover_timeout (default 0.0) - Thanks [~adam-mesos] #irc
Update:
Not setting the failover_timeout, just makes the mesos-master fail the
framework immediately (adding it to the completedFrameworks ring_buffer). It
should instead be set to a very large value for example sys.maxint / 10e9 in
python.
was (Author: paulzhol):
Hi,
Attached a log excerpt from a mesos-master 0.16.0 after a Zookeeper cluster
hick-up:
{noformat}
I0605 00:05:16.398090 24908 master.cpp:872] Re-registering framework
background_0 at scheduler(1)@xxx.xxx.xxx.xxx:48561
I0605 00:05:16.398108 24908 master.cpp:910] Allowing the Framework background_0
to re-register with an already used id
W0605 00:05:16.774268 24908 master.cpp:1393] Slave at
slave(1)@xxx.xxx.xxx.xxx:5044 (deb015) is being allowed to re-register with an
already in use id (201406030648-1821603594-5043-25313-1)
W0605 00:05:16.774646 24908 master.cpp:2384] Slave
201406030648-1821603594-5043-25313-1 (deb015) re-registered with completed
framework background_0. Shutting down the framework on the slave
I0605 00:05:17.522913 24915 master.cpp:1592] Executor executor-background_0 of
framework background_0 on slave 201406030648-1821603594-5043-25313-1 (deb015)
has terminated with signal Real-time signal 9
{noformat}
We are re-registering a scheduler with the same frameworkId as a previously
failed one (background_0) - this is a mistake on our part.
The master forces the re-registered slave to kill the background_0 executor as
the background_0 frameworkId is already in the completedFrameworks
circular_buffer. No TASK_LOST/TASK_KILLED are being sent to the re-registered
scheduler.
Also I'm not sure if there's another issue for this, but executorLost callbacks
are never get called as it is.
The workaround is to reset all the masters to clear their completedFrameworks
state, and stop re-using failed frameworkId's. Or alternatively not to set the
failover_timeout (default 0.0) - Thanks [~adam-mesos] #irc
> Master should generate new id for frameworks that reconnect after failover
> timeout
> ----------------------------------------------------------------------------------
>
> Key: MESOS-1219
> URL: https://issues.apache.org/jira/browse/MESOS-1219
> Project: Mesos
> Issue Type: Bug
> Components: master, webui
> Reporter: Robert Lacroix
>
> When a scheduler reconnects after the failover timeout has exceeded, the
> framework id is usually reused because the scheduler doesn't know that the
> timeout exceeded and it is actually handled as a new framework.
> The /framework/:framework_id route of the Web UI doesn't handle those cases
> very well because its key is reused. It only shows the terminated one.
> Would it make sense to ignore the provided framework id when a scheduler
> reconnects to a terminated framework and generate a new id to make sure it's
> unique?
--
This message was sent by Atlassian JIRA
(v6.2#6252)