[ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047464#comment-15047464
 ] 

Joris Van Remoortere commented on MESOS-4071:
---------------------------------------------

My main fear here is that this wouldn't catch scenarios where the delta 
gradually gets larger as operations are performed.
[~jamespeach]Would you be up for writing a simple test case where we apply the 
arithmetic resource operations (eg. add, then subtract) iteratively to see if 
there are conditions under which the delta grows?

If the delta can grow then an `almostEquals` approach will just make the 
problem rarer, and not solve it. In this case we need to fix the math itself.

I want to make sure that we do not "push the problem down the road", especially 
if there are logical branches dependent on this math. There are likely even 
more of these in the schedulers that we communicate with, rather than the ones 
pointed out in the mesos code base.

> Master crash during framework teardown ( Check failed: 
> total.resources.contains(slaveId))
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-4071
>                 URL: https://issues.apache.org/jira/browse/MESOS-4071
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.25.0
>            Reporter: Mandeep Chadha
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX 
> {code}
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
> [email protected]:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
> [email protected]:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
>     @     0x7f2b3dda53d8  google::LogMessage::Fail()
>     @     0x7f2b3dda5327  google::LogMessage::SendToLog()
>     @     0x7f2b3dda4d38  google::LogMessage::Flush()
>     @     0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f2b3d3351a1  
> mesos::internal::master::allocator::DRFSorter::remove()
>     @     0x7f2b3d0b8c29  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
>     @     0x7f2b3d0ca823 
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
>     @     0x7f2b3d0dc8dc  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
>     @     0x7f2b3dd2cc35  std::function<>::operator()()
>     @     0x7f2b3dd15ae5  process::ProcessBase::visit()
>     @     0x7f2b3dd188e2  process::DispatchEvent::visit()
>     @           0x472366  process::ProcessBase::serve()
>     @     0x7f2b3dd1203f  process::ProcessManager::resume()
>     @     0x7f2b3dd061b2  process::internal::schedule()
>     @     0x7f2b3dd63efd  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
>     @     0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
>     @     0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
>     @       0x318c2b6470  (unknown)
>     @       0x318b2079d1  (unknown)
>     @       0x318aae8b5d  (unknown)
>     @              (nil)  (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to