[
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler updated MESOS-1821:
-----------------------------------
Description:
Looks like the recent CHECKs I've added exposed a bug in the framework
re-registration logic by which we didn't keep the executors consistent between
the Slave and Framework structs:
{noformat: title=Master Log}
I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0
at slave(1)@IP:5051 (HOSTNAME) exited with status 0
I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with
resources cpus(*):0.19; disk(*):15; mem(*):127 of framework
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0
at slave(1)@IP:5051 (HOSTNAME)
F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId,
executorId) Unknown executor aurora.gc of framework
201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0
*** Check failure stack trace: ***
@ 0x7fd16c81737d google::LogMessage::Fail()
@ 0x7fd16c8191c4 google::LogMessage::SendToLog()
@ 0x7fd16c816f6c google::LogMessage::Flush()
@ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor()
@ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor()
@ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor()
@ 0x7fd16c348269 ProtobufProcess<>::handler4<>()
@ 0x7fd16c2fc18e std::_Function_handler<>::_M_invoke()
@ 0x7fd16c322132 ProtobufProcess<>::visit()
@ 0x7fd16c2cef7a mesos::internal::master::Master::_visit()
@ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit()
@ 0x7fd16c7c2502 process::ProcessManager::resume()
@ 0x7fd16c7c280c process::schedule()
@ 0x7fd16b9c683d start_thread
@ 0x7fd16a2b626d clone
{noformat}
This occurs sometime after a failover and indicates that the Slave and
Framework structs are not kept in sync.
Problem seems to be here, when re-registering a framework on a failed over
master, we only consider executors for which there are tasks stored in the
master:
{code}
void Master::_reregisterFramework(
const UPID& from,
const FrameworkInfo& frameworkInfo,
bool failover,
const Future<Option<Error> >& validationError)
{
...
if (frameworks.registered.count(frameworkInfo.id()) > 0) {
...
} else {
// We don't have a framework with this ID, so we must be a newly
// elected Mesos master to which either an existing scheduler or a
// failed-over one is connecting. Create a Framework object and add
// any tasks it has that have been reported by reconnecting slaves.
Framework* framework =
new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
framework->reregisteredTime = Clock::now();
// TODO(benh): Check for root submissions like above!
// Add any running tasks reported by slaves for this framework.
foreachvalue (Slave* slave, slaves.registered) {
foreachkey (const FrameworkID& frameworkId, slave->tasks) {
foreachvalue (Task* task, slave->tasks[frameworkId]) {
if (framework->id == task->framework_id()) {
framework->addTask(task);
// Also add the task's executor for resource accounting
// if it's still alive on the slave and we've not yet
// added it to the framework.
if (task->has_executor_id() &&
slave->hasExecutor(framework->id, task->executor_id()) &&
!framework->hasExecutor(slave->id, task->executor_id())) {
// XXX: If an executor has no tasks, the executor will not
// XXX: be added to the Framework struct!
const ExecutorInfo& executorInfo =
slave->executors[framework->id][task->executor_id()];
framework->addExecutor(slave->id, executorInfo);
}
}
}
}
}
// N.B. Need to add the framework _after_ we add its tasks
// (above) so that we can properly determine the resources it's
// currently using!
addFramework(framework);
}
}
{code}
was:
Looks like the recent CHECKs I've added exposed a bug in the framework
re-registration logic by which we didn't keep the executors consistent between
the Slave and Framework structs:
{noformat: title=Master Log}
I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0
at slave(1)@10.34.110.134:5051 (smf1-aeg-35-sr3.prod.twitter.com) exited with
status 0
I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with
resources cpus(*):0.19; disk(*):15; mem(*):127 of framework
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0
at slave(1)@10.34.110.134:5051 (smf1-aeg-35-sr3.prod.twitter.com)
F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId,
executorId) Unknown executor aurora.gc of framework
201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0
*** Check failure stack trace: ***
@ 0x7fd16c81737d google::LogMessage::Fail()
@ 0x7fd16c8191c4 google::LogMessage::SendToLog()
@ 0x7fd16c816f6c google::LogMessage::Flush()
@ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor()
@ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor()
@ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor()
@ 0x7fd16c348269 ProtobufProcess<>::handler4<>()
@ 0x7fd16c2fc18e std::_Function_handler<>::_M_invoke()
@ 0x7fd16c322132 ProtobufProcess<>::visit()
@ 0x7fd16c2cef7a mesos::internal::master::Master::_visit()
@ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit()
@ 0x7fd16c7c2502 process::ProcessManager::resume()
@ 0x7fd16c7c280c process::schedule()
@ 0x7fd16b9c683d start_thread
@ 0x7fd16a2b626d clone
{noformat}
This occurs sometime after a failover and indicates that the Slave and
Framework structs are not kept in sync.
Problem seems to be here, when re-registering a framework on a failed over
master, we only consider executors for which there are tasks stored in the
master:
{code}
void Master::_reregisterFramework(
const UPID& from,
const FrameworkInfo& frameworkInfo,
bool failover,
const Future<Option<Error> >& validationError)
{
...
if (frameworks.registered.count(frameworkInfo.id()) > 0) {
...
} else {
// We don't have a framework with this ID, so we must be a newly
// elected Mesos master to which either an existing scheduler or a
// failed-over one is connecting. Create a Framework object and add
// any tasks it has that have been reported by reconnecting slaves.
Framework* framework =
new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
framework->reregisteredTime = Clock::now();
// TODO(benh): Check for root submissions like above!
// Add any running tasks reported by slaves for this framework.
foreachvalue (Slave* slave, slaves.registered) {
foreachkey (const FrameworkID& frameworkId, slave->tasks) {
foreachvalue (Task* task, slave->tasks[frameworkId]) {
if (framework->id == task->framework_id()) {
framework->addTask(task);
// Also add the task's executor for resource accounting
// if it's still alive on the slave and we've not yet
// added it to the framework.
if (task->has_executor_id() &&
slave->hasExecutor(framework->id, task->executor_id()) &&
!framework->hasExecutor(slave->id, task->executor_id())) {
// XXX: If an executor has no tasks, the executor will not
// XXX: be added to the Framework struct!
const ExecutorInfo& executorInfo =
slave->executors[framework->id][task->executor_id()];
framework->addExecutor(slave->id, executorInfo);
}
}
}
}
}
// N.B. Need to add the framework _after_ we add its tasks
// (above) so that we can properly determine the resources it's
// currently using!
addFramework(framework);
}
}
{code}
> CHECK failure in master.
> ------------------------
>
> Key: MESOS-1821
> URL: https://issues.apache.org/jira/browse/MESOS-1821
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.21.0
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
>
> Looks like the recent CHECKs I've added exposed a bug in the framework
> re-registration logic by which we didn't keep the executors consistent
> between the Slave and Framework structs:
> {noformat: title=Master Log}
> I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework
> 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0
> at slave(1)@IP:5051 (HOSTNAME) exited with status 0
> I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc'
> with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework
> 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0
> at slave(1)@IP:5051 (HOSTNAME)
> F0919 18:05:06.915375 28914 master.hpp:1061] Check failed:
> hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework
> 201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0
> *** Check failure stack trace: ***
> @ 0x7fd16c81737d google::LogMessage::Fail()
> @ 0x7fd16c8191c4 google::LogMessage::SendToLog()
> @ 0x7fd16c816f6c google::LogMessage::Flush()
> @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor()
> @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor()
> @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor()
> @ 0x7fd16c348269 ProtobufProcess<>::handler4<>()
> @ 0x7fd16c2fc18e std::_Function_handler<>::_M_invoke()
> @ 0x7fd16c322132 ProtobufProcess<>::visit()
> @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit()
> @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit()
> @ 0x7fd16c7c2502 process::ProcessManager::resume()
> @ 0x7fd16c7c280c process::schedule()
> @ 0x7fd16b9c683d start_thread
> @ 0x7fd16a2b626d clone
> {noformat}
> This occurs sometime after a failover and indicates that the Slave and
> Framework structs are not kept in sync.
> Problem seems to be here, when re-registering a framework on a failed over
> master, we only consider executors for which there are tasks stored in the
> master:
> {code}
> void Master::_reregisterFramework(
> const UPID& from,
> const FrameworkInfo& frameworkInfo,
> bool failover,
> const Future<Option<Error> >& validationError)
> {
> ...
> if (frameworks.registered.count(frameworkInfo.id()) > 0) {
> ...
> } else {
> // We don't have a framework with this ID, so we must be a newly
> // elected Mesos master to which either an existing scheduler or a
> // failed-over one is connecting. Create a Framework object and add
> // any tasks it has that have been reported by reconnecting slaves.
> Framework* framework =
> new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
> framework->reregisteredTime = Clock::now();
> // TODO(benh): Check for root submissions like above!
> // Add any running tasks reported by slaves for this framework.
> foreachvalue (Slave* slave, slaves.registered) {
> foreachkey (const FrameworkID& frameworkId, slave->tasks) {
> foreachvalue (Task* task, slave->tasks[frameworkId]) {
> if (framework->id == task->framework_id()) {
> framework->addTask(task);
> // Also add the task's executor for resource accounting
> // if it's still alive on the slave and we've not yet
> // added it to the framework.
> if (task->has_executor_id() &&
> slave->hasExecutor(framework->id, task->executor_id()) &&
> !framework->hasExecutor(slave->id, task->executor_id())) {
> // XXX: If an executor has no tasks, the executor will not
> // XXX: be added to the Framework struct!
> const ExecutorInfo& executorInfo =
> slave->executors[framework->id][task->executor_id()];
> framework->addExecutor(slave->id, executorInfo);
> }
> }
> }
> }
> }
> // N.B. Need to add the framework _after_ we add its tasks
> // (above) so that we can properly determine the resources it's
> // currently using!
> addFramework(framework);
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)