[
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998287#comment-14998287
]
Bernd Mathiske commented on MESOS-3851:
---------------------------------------
[~marco-mesos]: Guess what I have been looking at as of yesterday :-)
[~anandmazumdar] has analyzed this well. There is highly likely be some race of
sorts between executor registration and task launching. That would completely
explain the CHECK that fails. There is another explanation that is less likely
also: faulty data or faulty marshaling or faulty transmission or faulty
unmarshaling. My focus is on understanding how the code allows for said race
and once I understand it, I will try to cause the race by inserting
sleep(someSeconds) somewhere suitable. Without that, there is no reliable way
of reproducing the bug. It never happens when I run this, not even on CentOS
7.1.
> Investigate recent crashes in Command Executor
> ----------------------------------------------
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Anand Mazumdar
> Priority: Blocker
> Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to
> support rootfs. There seem to be some tests showing frequent crashes due to
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6-0000 at
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682 1236 exec.cpp:297]
> I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa google::LogMessage::Fail()
> I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359 google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal()
> @ 0x48d00a _CheckFatal::~_CheckFatal()
> @ 0x49c99d
> mesos::internal::CommandExecutorProcess::launchTask()
> @ 0x4b3dd7
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @ 0x4c470c
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b std::function<>::operator()()
> @ 0x7f9f5a749935 process::ProcessBase::visit()
> @ 0x7f9f5a74d700 process::DispatchEvent::visit()
> @ 0x48e004 process::ProcessBase::serve()
> @ 0x7f9f5a745d21 process::ProcessManager::resume()
> @ 0x7f9f5a742f52
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0 (unknown)
> @ 0x7f9f564a8df5 start_thread
> @ 0x7f9f559b71ad __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log:
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager
> successfully handled status update acknowledgement (UUID:
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-0000
> @ 0x7f4f715296ce google::LogMessage::Flush()
> @ 0x7f4f7152c402 google::LogMessageFatal::~LogMessageFatal()
> @ 0x48d00a _CheckFatal::~_CheckFatal()
> @ 0x49c99d
> mesos::internal::CommandExecutorProcess::launchTask()
> @ 0x4b3dd7
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @ 0x4c470c
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f4f714b047f std::function<>::operator()()
> @ 0x7f4f71498299 process::ProcessBase::visit()
> @ 0x7f4f7149c064 process::DispatchEvent::visit()
> @ 0x48e004 process::ProcessBase::serve()
> @ 0x7f4f71494685 process::ProcessManager::resume()
> {code}
> Full logs at:
> https://builds.apache.org/job/Mesos/1191/COMPILER=gcc,CONFIGURATION=--verbose,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)