[
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047464#comment-15047464
]
Joris Van Remoortere commented on MESOS-4071:
---------------------------------------------
My main fear here is that this wouldn't catch scenarios where the delta
gradually gets larger as operations are performed.
[~jamespeach]Would you be up for writing a simple test case where we apply the
arithmetic resource operations (eg. add, then subtract) iteratively to see if
there are conditions under which the delta grows?
If the delta can grow then an `almostEquals` approach will just make the
problem rarer, and not solve it. In this case we need to fix the math itself.
I want to make sure that we do not "push the problem down the road", especially
if there are logical branches dependent on this math. There are likely even
more of these in the schedulers that we communicate with, rather than the ones
pointed out in the mesos code base.
> Master crash during framework teardown ( Check failed:
> total.resources.contains(slaveId))
> -----------------------------------------------------------------------------------------
>
> Key: MESOS-4071
> URL: https://issues.apache.org/jira/browse/MESOS-4071
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.25.0
> Reporter: Mandeep Chadha
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX
> {code}
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at
> [email protected]:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at
> [email protected]:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed:
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f2b3dda53d8 google::LogMessage::Fail()
> @ 0x7f2b3dda5327 google::LogMessage::SendToLog()
> @ 0x7f2b3dda4d38 google::LogMessage::Flush()
> @ 0x7f2b3dda7a6c google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2b3d3351a1
> mesos::internal::master::allocator::DRFSorter::remove()
> @ 0x7f2b3d0b8c29
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> @ 0x7f2b3d0ca823
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
> @ 0x7f2b3d0dc8dc
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
> @ 0x7f2b3dd2cc35 std::function<>::operator()()
> @ 0x7f2b3dd15ae5 process::ProcessBase::visit()
> @ 0x7f2b3dd188e2 process::DispatchEvent::visit()
> @ 0x472366 process::ProcessBase::serve()
> @ 0x7f2b3dd1203f process::ProcessManager::resume()
> @ 0x7f2b3dd061b2 process::internal::schedule()
> @ 0x7f2b3dd63efd
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
> @ 0x7f2b3dd63e4d std::_Bind_simple<>::operator()()
> @ 0x7f2b3dd63de6 std::thread::_Impl<>::_M_run()
> @ 0x318c2b6470 (unknown)
> @ 0x318b2079d1 (unknown)
> @ 0x318aae8b5d (unknown)
> @ (nil) (unknown)
> Aborted (core dumped)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)