[
https://issues.apache.org/jira/browse/MESOS-218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kone resolved MESOS-218.
------------------------------
Resolution: Fixed
Fixed in trunk.
> Master throws exception on removeTask() if Framework is not connected
> ---------------------------------------------------------------------
>
> Key: MESOS-218
> URL: https://issues.apache.org/jira/browse/MESOS-218
> Project: Mesos
> Issue Type: Bug
> Reporter: Vinod Kone
>
> When a slave is disconnected from the master, the master removes all tasks
> belonging to that slave.
> If a framework is disconnected during this period, removeTask() throws an
> exception. This can result in LOST tasks not being reported to the scheduler.
> This is bad because framework now thinks the task is running, but the
> executor doesnt think so. But the TASK_KILLED messages from executor are
> dropped by slave, because the (restarted) slave has no idea about the task.
> I0623 00:58:36.758640 28346 master.cpp:1694] Adding slave
> 201206230058-1937777162-5050-28332-0 at smf1-afg-23-sr3.prod.twitter.com with
> cpus=14; mem=22528; ports=[31000-32000]; disk=400000
> I0623 00:58:36.758826 28346 simple_allocator.cpp:69] Added slave
> 201206230058-1937777162-5050-28332-0 with cpus=14; mem=22528;
> ports=[31000-32000]; disk=400
> 000
> I0623 00:58:36.761170 28344 master.cpp:941] Attempting to register slave on
> smf1-aff-31-sr4.prod.twitter.com at slave(1)@10.34.135.131:5051
> I0623 00:58:36.761245 28344 master.cpp:1158] Master now considering a slave
> at smf1-aff-31-sr4.prod.twitter.com:5051 as active
> I0623 00:58:36.761275 28344 master.cpp:1694] Adding slave
> 201206230058-1937777162-5050-28332-1 at smf1-aff-31-sr4.prod.twitter.com with
> cpus=14; mem=22528;
> ports=[31000-32000]; disk=400000
> I0623 00:58:36.761489 28344 simple_allocator.cpp:69] Added slave
> 201206230058-1937777162-5050-28332-1 with cpus=14; mem=22528;
> ports=[31000-32000]; disk=400
> 000
> 2012-06-23 00:58:39,871:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983:
> Got ping response in 0 ms
> I0623 00:58:39.910228 28342 master.cpp:70] Watching path
> file:///usr/local/mesos/conf/whitelist.txt
> I0623 00:58:39.910339 28342 master.cpp:98] Whitelisting slave
> smf1-afg-23-sr3.prod.twitter.com
> I0623 00:58:39.910395 28342 master.cpp:98] Whitelisting slave
> smf1-aff-31-sr4.prod.twitter.com
> 2012-06-23 00:58:43,208:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983:
> Got ping response in 0 ms
> I0623 00:58:44.911403 28346 master.cpp:70] Watching path
> file:///usr/local/mesos/conf/whitelist.txt
> I0623 00:58:44.911511 28346 master.cpp:98] Whitelisting slave
> smf1-afg-23-sr3.prod.twitter.com
> I0623 00:58:44.911541 28346 master.cpp:98] Whitelisting slave
> smf1-aff-31-sr4.prod.twitter.com
> 2012-06-23 00:58:46,545:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983:
> Got ping response in 0 ms
> I0623 00:58:49.738129 28345 master.cpp:548] Slave
> 201206160031-1937777162-5050-11967-3 disconnected
> F0623 00:58:49.738231 28345 master.cpp:1880] Check failed: framework != NULL
> *** Check failure stack trace: ***
> @ 0x7f032d18e3fd google::LogMessage::Fail()
> @ 0x7f032d194067 google::LogMessage::SendToLog()
> @ 0x7f032d18fcac google::LogMessage::Flush()
> @ 0x7f032d18ff16 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f032cedb462 mesos::internal::master::Master::removeTask()
> @ 0x7f032cee58d6 mesos::internal::master::Master::removeSlave()
> @ 0x7f032cee7b6e mesos::internal::master::Master::exited()
> @ 0x7f032d0ac3f2 process::ProcessBase::visit()
> @ 0x7f032d0be4f6 process::ExitedEvent::visit()
> @ 0x7f032d0b7054 process::ProcessManager::resume()
> @ 0x7f032d0b78a7 process::schedule()
> @ 0x7f032c5f573d start_thread
> @ 0x7f032bbdff6d clone
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8080/
> Use Ctrl-C to quit.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira