[ 
https://issues.apache.org/jira/browse/MESOS-218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone resolved MESOS-218.
------------------------------

    Resolution: Fixed

Fixed in trunk.
                
> Master throws exception on removeTask() if Framework is not connected
> ---------------------------------------------------------------------
>
>                 Key: MESOS-218
>                 URL: https://issues.apache.org/jira/browse/MESOS-218
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>
> When a slave is disconnected from the master, the master removes all tasks 
> belonging to that slave.
> If a framework is disconnected during this period, removeTask() throws an 
> exception. This can result in LOST tasks not being reported to the scheduler. 
> This is bad because framework now thinks the task is running, but the 
> executor doesnt think so. But the TASK_KILLED messages from executor are 
> dropped by slave, because the (restarted) slave has no idea about the task.
> I0623 00:58:36.758640 28346 master.cpp:1694] Adding slave 
> 201206230058-1937777162-5050-28332-0 at smf1-afg-23-sr3.prod.twitter.com with 
> cpus=14; mem=22528; ports=[31000-32000]; disk=400000
> I0623 00:58:36.758826 28346 simple_allocator.cpp:69] Added slave 
> 201206230058-1937777162-5050-28332-0 with cpus=14; mem=22528; 
> ports=[31000-32000]; disk=400
> 000
> I0623 00:58:36.761170 28344 master.cpp:941] Attempting to register slave on 
> smf1-aff-31-sr4.prod.twitter.com at slave(1)@10.34.135.131:5051
> I0623 00:58:36.761245 28344 master.cpp:1158] Master now considering a slave 
> at smf1-aff-31-sr4.prod.twitter.com:5051 as active
> I0623 00:58:36.761275 28344 master.cpp:1694] Adding slave 
> 201206230058-1937777162-5050-28332-1 at smf1-aff-31-sr4.prod.twitter.com with 
> cpus=14; mem=22528; 
> ports=[31000-32000]; disk=400000
> I0623 00:58:36.761489 28344 simple_allocator.cpp:69] Added slave 
> 201206230058-1937777162-5050-28332-1 with cpus=14; mem=22528; 
> ports=[31000-32000]; disk=400
> 000
> 2012-06-23 00:58:39,871:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983: 
> Got ping response in 0 ms
> I0623 00:58:39.910228 28342 master.cpp:70] Watching path 
> file:///usr/local/mesos/conf/whitelist.txt
> I0623 00:58:39.910339 28342 master.cpp:98] Whitelisting slave 
> smf1-afg-23-sr3.prod.twitter.com
> I0623 00:58:39.910395 28342 master.cpp:98] Whitelisting slave 
> smf1-aff-31-sr4.prod.twitter.com
> 2012-06-23 00:58:43,208:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983: 
> Got ping response in 0 ms
> I0623 00:58:44.911403 28346 master.cpp:70] Watching path 
> file:///usr/local/mesos/conf/whitelist.txt
> I0623 00:58:44.911511 28346 master.cpp:98] Whitelisting slave 
> smf1-afg-23-sr3.prod.twitter.com
> I0623 00:58:44.911541 28346 master.cpp:98] Whitelisting slave 
> smf1-aff-31-sr4.prod.twitter.com
> 2012-06-23 00:58:46,545:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983: 
> Got ping response in 0 ms
> I0623 00:58:49.738129 28345 master.cpp:548] Slave 
> 201206160031-1937777162-5050-11967-3 disconnected
> F0623 00:58:49.738231 28345 master.cpp:1880] Check failed: framework != NULL
> *** Check failure stack trace: ***
>     @     0x7f032d18e3fd  google::LogMessage::Fail()
>     @     0x7f032d194067  google::LogMessage::SendToLog()
>     @     0x7f032d18fcac  google::LogMessage::Flush()
>     @     0x7f032d18ff16  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f032cedb462  mesos::internal::master::Master::removeTask()
>     @     0x7f032cee58d6  mesos::internal::master::Master::removeSlave()
>     @     0x7f032cee7b6e  mesos::internal::master::Master::exited()
>     @     0x7f032d0ac3f2  process::ProcessBase::visit()
>     @     0x7f032d0be4f6  process::ExitedEvent::visit()
>     @     0x7f032d0b7054  process::ProcessManager::resume()
>     @     0x7f032d0b78a7  process::schedule()
>     @     0x7f032c5f573d  start_thread
>     @     0x7f032bbdff6d  clone
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8080/
> Use Ctrl-C to quit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to