[ https://issues.apache.org/jira/browse/MESOS-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757264#comment-13757264 ]
Benjamin Mahler commented on MESOS-675: --------------------------------------- There is a definite race here in that the master repeatedly calls {{link()}} on the same slave whenever it gets the duplicate re-registration messages. The CHECK failure above can occur if: -> Slave run #1 terminates -> delayed re-registration messages from Slave #1 are enqueued on the Master -> delayed exited event from Slave run #1 is enqueued on the Master (now possible to re-link the slave) -> Slave starts again (run #2) -> Master::reregisterSlave() calls link() (this unintentionally links against Slave run #2 since the UPID is the same) -> Slave run #2 crashes because it received a re-registered message while recovering -> An additional exited event is now enqueued on the Master -> Master goes through its queue, eventually processes both exited events We can do either or both of the following: 1. Only call {{link()}} the slave when re-registering if it was disconnected, rather than always calling {{link()}}. 2. Remove the CHECK(!slave->disconnected) and instead ignore duplicate exited events. > CHECK failure in the Master. > ---------------------------- > > Key: MESOS-675 > URL: https://issues.apache.org/jira/browse/MESOS-675 > Project: Mesos > Issue Type: Bug > Reporter: Benjamin Mahler > Assignee: Benjamin Mahler > Priority: Blocker > Fix For: 0.14.0 > > > Observed this failure in a staging cluster running 0.14.0-rc2. > {noformat} > F0902 06:01:11.105391 11876 master.cpp:564] Check failed: > !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (<scrub>) > already disconnected! > *** Check failure stack trace: *** > @ 0x7fb470894d8d google::LogMessage::Fail() > @ 0x7fb470898d77 google::LogMessage::SendToLog() > @ 0x7fb470897674 google::LogMessage::Flush() > @ 0x7fb4708978a6 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fb4704aaea4 mesos::internal::master::Master::exited() > @ 0x7fb470786af4 process::ProcessManager::resume() > @ 0x7fb47078754f process::schedule() > @ 0x7fb46fef483d start_thread > @ 0x7fb46e8d6f8d clone > {noformat} > Grepping for this slave in the logs: > {noformat} > $ grep 201308270033-1937777162-5050-50911-137 /var/log/mesos/mesos-master.log > W0902 06:01:10.607168 11876 master.cpp:1317] Ignoring unknown exited executor > thermos-1377831261464-mesos-slave-recovery-spinner-60-f0bcfda6-4f8d-4df4-bd74-0b15f32d0502 > on slave 201308270033-1937777162-5050-50911-137 (<scrub>) > ... > W0902 06:01:10.646383 11876 master.cpp:1317] Ignoring unknown exited executor > thermos-1377964938274-mesos-slave-recovery-spinner-184-3a25b824-5d73-4be0-984d-606230c5e8ac > on slave 201308270033-1937777162-5050-50911-137 (<scrub>) > W0902 06:01:10.699635 11876 master.cpp:1123] Slave at > slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an > already in use id (201308270033-1937777162-5050-50911-137) > I0902 06:01:10.700628 11868 hierarchical_allocator_process.hpp:434] Added > slave 201308270033-1937777162-5050-50911-137 (<scrub>) with cpus(*):14; > mem(*):21913; ports(*):[31000-32000]; disk(*):400000 (and cpus(*):10.96; > mem(*):19866; ports(*):[31000-31003, 31005-31449, 31451-31580, 31582-31801, > 31803-31927, 31929-32000]; disk(*):397809 available) > W0902 06:01:10.866525 11876 master.cpp:1123] Slave at > slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an > already in use id (201308270033-1937777162-5050-50911-137) > W0902 06:01:10.919178 11876 master.cpp:1123] Slave at > slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an > already in use id (201308270033-1937777162-5050-50911-137) > W0902 06:01:11.070862 11876 master.cpp:1123] Slave at > slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an > already in use id (201308270033-1937777162-5050-50911-137) > I0902 06:01:11.085773 11876 master.cpp:553] Slave > 201308270033-1937777162-5050-50911-137 (<scrub>) disconnected > W0902 06:01:11.086096 11876 master.cpp:1404] Master returning resources > offered because slave 201308270033-1937777162-5050-50911-137 is disconnected > I0902 06:01:11.086145 11867 hierarchical_allocator_process.hpp:459] Removed > slave 201308270033-1937777162-5050-50911-137 > I0902 06:01:11.104651 11876 master.cpp:553] Slave > 201308270033-1937777162-5050-50911-137 (<scrub>) disconnected > F0902 06:01:11.105391 11876 master.cpp:564] Check failed: > !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (<scrub>) > already disconnected! > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira