[
https://issues.apache.org/jira/browse/MESOS-10211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365087#comment-17365087
]
Charles Natali commented on MESOS-10211:
----------------------------------------
[~ggmmggmm2] so could you give more details?
> mesos agent crashes every time when launched tensorboard in a horovod image
> with mesos container
> ------------------------------------------------------------------------------------------------
>
> Key: MESOS-10211
> URL: https://issues.apache.org/jira/browse/MESOS-10211
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.11.0
> Environment: agent:ubuntu18.04
> Reporter: YZ sun
> Priority: Critical
>
> When launch a task using image
> "horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1",
> if tensorboard in this image is started,
> the agent node will immediately crash every time.
> if tensorboard is not started by command, mesos will just work as expected.
> agent log looks like below:
> {code:java}
> //agent crash
> I0127 16:07:21.860065 30960 slave.cpp:3181] Launching task
> 'baseEnvSingle_gpunode1' for framework baseDevEnv_root_1611734806
> F0127 16:07:21.860143 30960 slave.cpp:3194] Check failed: executor == nullptr
> *** Check failure stack trace: ***
> @ 0x7f2bcc4221fc google::LogMessage::Fail()
> @ 0x7f2bcc422145 google::LogMessage::SendToLog()
> @ 0x7f2bcc421ad1 google::LogMessage::Flush()
> @ 0x7f2bcc4251e8 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2bca4cb10b mesos::internal::slave::Slave::__run()
> @ 0x7f2bca570ac6
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDESaISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_
> @ 0x7f2bca663b01
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19ResourceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_
> @ 0x7f2bca6555dc
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEE13invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEEDTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS12_Efp2_EEEEOSX_OS11_N5cpp1416integer_sequenceImJXspT2_EEEEOS12_
> @ 0x7f2bca64da94
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEEclIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS25_
> @ 0x7f2bca647e56
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfoEERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_SI_SM_SS_SV_bSt12_PlaceholderILi1EEEEEJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_
> @ 0x7f2bca645145
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupInfoEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EEEEEJS1Z_EEEvOS10_DpOT0_
> @ 0x7f2bca641d60
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbEbSG_SJ_SO_SS_SY_S11_bEEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_ObS3_E_JSE_SH_SM_SQ_SW_SZ_bSt12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7f2bcc2f7a59
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7f2bcc2baae8 process::ProcessBase::consume()
> @ 0x7f2bcc2e475c
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @ 0x55bb81f997ae process::ProcessBase::serve()
> @ 0x7f2bcc2b7486 process::ProcessManager::resume()
> @ 0x7f2bcc2b3878
> _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
> @ 0x7f2bcc2c2c1d
> _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
> @ 0x7f2bcc2bf84c
> _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
> @ 0x7f2bcc2dddca
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
> @ 0x7f2bcc2dce3e
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
> @ 0x7f2bcc2dbc7e
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
> @ 0x7f2bbda376df (unknown)
> @ 0x7f2bbd54a6db start_thread
> @ 0x7f2bbd27371f clone
> Aborted (core dumped)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)