[jira] [Comment Edited] (MESOS-1219) Master should generate new id for frameworks that reconnect after failover timeout

Yuval Pavel Zholkover (JIRA) Thu, 05 Jun 2014 10:29:22 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018908#comment-14018908
 ]


Yuval Pavel Zholkover edited comment on MESOS-1219 at 6/5/14 5:28 PM:
----------------------------------------------------------------------

Hi,

Attached a log excerpt from a mesos-master 0.16.0 after a Zookeeper cluster 
hick-up:
{noformat}
I0605 00:05:16.398090 24908 master.cpp:872] Re-registering framework 
background_0 at scheduler(1)@xxx.xxx.xxx.xxx:48561
I0605 00:05:16.398108 24908 master.cpp:910] Allowing the Framework background_0 
to re-register with an already used id

W0605 00:05:16.774268 24908 master.cpp:1393] Slave at 
slave(1)@xxx.xxx.xxx.xxx:5044 (deb015) is being allowed to re-register with an 
already in use id (201406030648-1821603594-5043-25313-1)
W0605 00:05:16.774646 24908 master.cpp:2384] Slave 
201406030648-1821603594-5043-25313-1 (deb015) re-registered with completed 
framework background_0. Shutting down the framework on the slave

I0605 00:05:17.522913 24915 master.cpp:1592] Executor executor-background_0 of 
framework background_0 on slave 201406030648-1821603594-5043-25313-1 (deb015) 
has terminated with signal Real-time signal 9
{noformat}

We are re-registering a scheduler with the same frameworkId as a previously 
failed one (background_0) - this is a mistake on our part.
The master forces the re-registered slave to kill the background_0 executor as 
the background_0 frameworkId is already in the completedFrameworks 
circular_buffer. No TASK_LOST/TASK_KILLED are being sent to the re-registered 
scheduler.
Also I'm not sure if there's another issue for this, but executorLost callbacks 
are never get called as it is.

The workaround is to reset all the masters to clear their completedFrameworks 
state, and stop re-using failed frameworkId's. Or alternatively not to set the 
failover_timeout  (default 0.0) - Thanks [~adam-mesos] #irc
Update:
Not setting the failover_timeout, just makes the mesos-master fail the 
framework immediately (adding it to the completedFrameworks ring_buffer). It 
should instead be set to a very large value for example sys.maxint / 10e9 in 
python.


was (Author: paulzhol):
Hi,

Attached a log excerpt from a mesos-master 0.16.0 after a Zookeeper cluster 
hick-up:
{noformat}
I0605 00:05:16.398090 24908 master.cpp:872] Re-registering framework 
background_0 at scheduler(1)@xxx.xxx.xxx.xxx:48561
I0605 00:05:16.398108 24908 master.cpp:910] Allowing the Framework background_0 
to re-register with an already used id

W0605 00:05:16.774268 24908 master.cpp:1393] Slave at 
slave(1)@xxx.xxx.xxx.xxx:5044 (deb015) is being allowed to re-register with an 
already in use id (201406030648-1821603594-5043-25313-1)
W0605 00:05:16.774646 24908 master.cpp:2384] Slave 
201406030648-1821603594-5043-25313-1 (deb015) re-registered with completed 
framework background_0. Shutting down the framework on the slave

I0605 00:05:17.522913 24915 master.cpp:1592] Executor executor-background_0 of 
framework background_0 on slave 201406030648-1821603594-5043-25313-1 (deb015) 
has terminated with signal Real-time signal 9
{noformat}

We are re-registering a scheduler with the same frameworkId as a previously 
failed one (background_0) - this is a mistake on our part.
The master forces the re-registered slave to kill the background_0 executor as 
the background_0 frameworkId is already in the completedFrameworks 
circular_buffer. No TASK_LOST/TASK_KILLED are being sent to the re-registered 
scheduler.
Also I'm not sure if there's another issue for this, but executorLost callbacks 
are never get called as it is.

The workaround is to reset all the masters to clear their completedFrameworks 
state, and stop re-using failed frameworkId's. Or alternatively not to set the 
failover_timeout  (default 0.0) - Thanks [~adam-mesos] #irc

> Master should generate new id for frameworks that reconnect after failover 
> timeout
> ----------------------------------------------------------------------------------
>
>                 Key: MESOS-1219
>                 URL: https://issues.apache.org/jira/browse/MESOS-1219
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, webui
>            Reporter: Robert Lacroix
>
> When a scheduler reconnects after the failover timeout has exceeded, the 
> framework id is usually reused because the scheduler doesn't know that the 
> timeout exceeded and it is actually handled as a new framework.
> The /framework/:framework_id route of the Web UI doesn't handle those cases 
> very well because its key is reused. It only shows the terminated one.
> Would it make sense to ignore the provided framework id when a scheduler 
> reconnects to a terminated framework and generate a new id to make sure it's 
> unique?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MESOS-1219) Master should generate new id for frameworks that reconnect after failover timeout

Reply via email to