[
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999494#comment-14999494
]
Vinod Kone commented on MESOS-3851:
-----------------------------------
Doesn't look like this is related to the new HTTP executor logic as this race
seem to happen even in non-http-executor based tests. Also the changes in slave
doesn't seem related. Either this race has always existed but only now got
exposed due to the CHECK in the command executor or there are some recent
libprocess related changes that are the cause.
> Investigate recent crashes in Command Executor
> ----------------------------------------------
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Anand Mazumdar
> Priority: Blocker
> Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to
> support rootfs. There seem to be some tests showing frequent crashes due to
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6-0000 at
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682 1236 exec.cpp:297]
> I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa google::LogMessage::Fail()
> I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359 google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal()
> @ 0x48d00a _CheckFatal::~_CheckFatal()
> @ 0x49c99d
> mesos::internal::CommandExecutorProcess::launchTask()
> @ 0x4b3dd7
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @ 0x4c470c
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b std::function<>::operator()()
> @ 0x7f9f5a749935 process::ProcessBase::visit()
> @ 0x7f9f5a74d700 process::DispatchEvent::visit()
> @ 0x48e004 process::ProcessBase::serve()
> @ 0x7f9f5a745d21 process::ProcessManager::resume()
> @ 0x7f9f5a742f52
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0 (unknown)
> @ 0x7f9f564a8df5 start_thread
> @ 0x7f9f559b71ad __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log:
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager
> successfully handled status update acknowledgement (UUID:
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-0000
> @ 0x7f4f715296ce google::LogMessage::Flush()
> @ 0x7f4f7152c402 google::LogMessageFatal::~LogMessageFatal()
> @ 0x48d00a _CheckFatal::~_CheckFatal()
> @ 0x49c99d
> mesos::internal::CommandExecutorProcess::launchTask()
> @ 0x4b3dd7
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @ 0x4c470c
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f4f714b047f std::function<>::operator()()
> @ 0x7f4f71498299 process::ProcessBase::visit()
> @ 0x7f4f7149c064 process::DispatchEvent::visit()
> @ 0x48e004 process::ProcessBase::serve()
> @ 0x7f4f71494685 process::ProcessManager::resume()
> {code}
> Full logs at:
> https://builds.apache.org/job/Mesos/1191/COMPILER=gcc,CONFIGURATION=--verbose,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)