[
https://issues.apache.org/jira/browse/MESOS-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224776#comment-17224776
]
Jerome Soussens edited comment on MESOS-10194 at 11/2/20, 4:10 PM:
-------------------------------------------------------------------
Hi [~asekretenko],
I dont know if it's related but today we had this failure on master with Mesos
1.10.0 :
{code:java}
F1102 11:40:04.209203 6522 hierarchical.cpp:233] Check failed:
scalars.at(slaveID) does not contain cpus(allocated: xxxxx):1; mem(allocated:
xxxxx):15360
*** Check failure stack trace: ***
e06184040 with resources mem(allocated: stable.main):256 of framework
b98761e9-2e84-4971-b678-13b6619b18e1 on agent
4bfe1c0d-aabc-45f4-98fe-3a5480058440-S0 at slave(1)@192.168.250.63:5051
(leta.sophiagenetics.com)
@ 0x7fd65a6fb94d google::LogMessage::SendToLog()
@ 0x7fd65a6f91fb google::LogMessage::Flush()
@ 0x7fd65a6fc3a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd659114217
mesos::internal::master::allocator::internal::ScalarResourceTotals::subtract()
@ 0x7fd659118202
mesos::internal::master::allocator::internal::RoleTree::untrackAllocated()
@ 0x7fd659129afa
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
@ 0x7fd65a637081 process::ProcessBase::consume()
@ 0x7fd65a65ceb7 process::ProcessManager::resume()
@ 0x7fd65a660a76
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7fd65a916d80 execute_native_thread_routine
@ 0x7fd6569cee25 start_thread
@ 0x7fd6561dbbad __clone
mesos-master.service: main process exited, code=killed, status=6/ABRT
{code}
A more complete log : [^mesos_scalars_at_slaveId_crash.log]
was (Author: jerome soussens):
Hi [~asekretenko],
I dont know if it's related but today we had this failure on master with Mesos
1.10.0 :
{code:java}
F1102 11:40:04.209203 6522 hierarchical.cpp:233] Check failed:
scalars.at(slaveID) does not contain cpus(allocated: xxxxx):1; mem(allocated:
xxxxx):15360
*** Check failure stack trace: ***
e06184040 with resources mem(allocated: stable.main):256 of framework
b98761e9-2e84-4971-b678-13b6619b18e1 on agent
4bfe1c0d-aabc-45f4-98fe-3a5480058440-S0 at slave(1)@192.168.250.63:5051
(leta.sophiagenetics.com)
@ 0x7fd65a6fb94d google::LogMessage::SendToLog()
@ 0x7fd65a6f91fb google::LogMessage::Flush()
@ 0x7fd65a6fc3a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd659114217
mesos::internal::master::allocator::internal::ScalarResourceTotals::subtract()
@ 0x7fd659118202
mesos::internal::master::allocator::internal::RoleTree::untrackAllocated()
@ 0x7fd659129afa
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
@ 0x7fd65a637081 process::ProcessBase::consume()
@ 0x7fd65a65ceb7 process::ProcessManager::resume()
@ 0x7fd65a660a76
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7fd65a916d80 execute_native_thread_routine
@ 0x7fd6569cee25 start_thread
@ 0x7fd6561dbbad __clone
mesos-master.service: main process exited, code=killed, status=6/ABRT
{code}
A more complete log : [^mesos_scalars_at_slaveId_crash.log]
> Mesos master failure "Check failed: 'get_(role)' Must be SOME"
> --------------------------------------------------------------
>
> Key: MESOS-10194
> URL: https://issues.apache.org/jira/browse/MESOS-10194
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.10.0, 1.11.0
> Reporter: Jerome Soussens
> Assignee: Andrei Sekretenko
> Priority: Critical
> Attachments: log_mesos_crash_role_13102020.txt,
> mesos_scalars_at_slaveId_crash.log
>
>
>
> *Impact* : mesos-master crash with log :
> {code:java}
> hierarchical.cpp:460] Check failed: 'get_(role)' Must be SOME
> {code}
> *Possible scenario :*
> A framework, using a specific role, is stopped. More or less at the same
> time, some remaining task status for this framework comes to the master from
> the executor. But the roles is no more listed.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)