[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998287#comment-14998287
 ] 

Bernd Mathiske commented on MESOS-3851:
---------------------------------------

[~marco-mesos]: Guess what I have been looking at as of yesterday :-)
[~anandmazumdar] has analyzed this well. There is highly likely be some race of 
sorts between executor registration and task launching. That would completely 
explain the CHECK that fails. There is another explanation that is less likely 
also: faulty data or faulty marshaling or faulty transmission or faulty 
unmarshaling. My focus is on understanding how the code allows for said race 
and once I understand it, I will try to cause the race by inserting 
sleep(someSeconds) somewhere suitable. Without that, there is no reliable way 
of reproducing the bug. It never happens when I run this, not even on CentOS 
7.1.

> Investigate recent crashes in Command Executor
> ----------------------------------------------
>
>                 Key: MESOS-3851
>                 URL: https://issues.apache.org/jira/browse/MESOS-3851
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Anand Mazumdar
>            Priority: Blocker
>              Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6-0000 at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
>     @     0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
>     @     0x7f9f5a7db359  google::LogMessage::SendToLog()
>     @     0x7f9f5a7dad6a  google::LogMessage::Flush()
>     @     0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
>     @           0x48d00a  _CheckFatal::~_CheckFatal()
>     @           0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
>     @           0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
>     @           0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>     @     0x7f9f5a761b1b  std::function<>::operator()()
>     @     0x7f9f5a749935  process::ProcessBase::visit()
>     @     0x7f9f5a74d700  process::DispatchEvent::visit()
>     @           0x48e004  process::ProcessBase::serve()
>     @     0x7f9f5a745d21  process::ProcessManager::resume()
>     @     0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
>     @     0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
>     @     0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
>     @     0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
>     @     0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
>     @     0x7f9f5624f1e0  (unknown)
>     @     0x7f9f564a8df5  start_thread
>     @     0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
>     @     0x7f4f71529cbd  google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager 
> successfully handled status update acknowledgement (UUID: 
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task 
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework 
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-0000
>     @     0x7f4f715296ce  google::LogMessage::Flush()
>     @     0x7f4f7152c402  google::LogMessageFatal::~LogMessageFatal()
>     @           0x48d00a  _CheckFatal::~_CheckFatal()
>     @           0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
>     @           0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
>     @           0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>     @     0x7f4f714b047f  std::function<>::operator()()
>     @     0x7f4f71498299  process::ProcessBase::visit()
>     @     0x7f4f7149c064  process::DispatchEvent::visit()
>     @           0x48e004  process::ProcessBase::serve()
>     @     0x7f4f71494685  process::ProcessManager::resume()
> {code}
> Full logs at:
> https://builds.apache.org/job/Mesos/1191/COMPILER=gcc,CONFIGURATION=--verbose,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to