[ 
https://issues.apache.org/jira/browse/MESOS-10211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365087#comment-17365087
 ] 

Charles Natali commented on MESOS-10211:
----------------------------------------

[~ggmmggmm2] so could you give more details?

> mesos agent crashes every time when launched tensorboard in a horovod image 
> with mesos container
> ------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-10211
>                 URL: https://issues.apache.org/jira/browse/MESOS-10211
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.11.0
>         Environment: agent:ubuntu18.04
>            Reporter: YZ sun
>            Priority: Critical
>
> When launch a task using image 
> "horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1",
> if tensorboard in this image is started,
> the agent node will immediately crash every time.
> if tensorboard is not started by command, mesos will just work as expected.
> agent log looks like below:
> {code:java}
> //agent crash
> I0127 16:07:21.860065 30960 slave.cpp:3181] Launching task 
> 'baseEnvSingle_gpunode1' for framework baseDevEnv_root_1611734806
> F0127 16:07:21.860143 30960 slave.cpp:3194] Check failed: executor == nullptr
> *** Check failure stack trace: ***
>     @     0x7f2bcc4221fc  google::LogMessage::Fail()
>     @     0x7f2bcc422145  google::LogMessage::SendToLog()
>     @     0x7f2bcc421ad1  google::LogMessage::Flush()
>     @     0x7f2bcc4251e8  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f2bca4cb10b  mesos::internal::slave::Slave::__run()
>     @     0x7f2bca570ac6  
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDESaISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_
>     @     0x7f2bca663b01  
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19ResourceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_
>     @     0x7f2bca6555dc  
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEE13invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEEDTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS12_Efp2_EEEEOSX_OS11_N5cpp1416integer_sequenceImJXspT2_EEEEOS12_
>     @     0x7f2bca64da94  
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEEclIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS25_
>     @     0x7f2bca647e56  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfoEERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_SI_SM_SS_SV_bSt12_PlaceholderILi1EEEEEJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_
>     @     0x7f2bca645145  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupInfoEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EEEEEJS1Z_EEEvOS10_DpOT0_
>     @     0x7f2bca641d60  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbEbSG_SJ_SO_SS_SY_S11_bEEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_ObS3_E_JSE_SH_SM_SQ_SW_SZ_bSt12_PlaceholderILi1EEEEEEclEOS3_
>     @     0x7f2bcc2f7a59  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
>     @     0x7f2bcc2baae8  process::ProcessBase::consume()
>     @     0x7f2bcc2e475c  
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
>     @     0x55bb81f997ae  process::ProcessBase::serve()
>     @     0x7f2bcc2b7486  process::ProcessManager::resume()
>     @     0x7f2bcc2b3878  
> _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
>     @     0x7f2bcc2c2c1d  
> _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
>     @     0x7f2bcc2bf84c  
> _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
>     @     0x7f2bcc2dddca  
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
>     @     0x7f2bcc2dce3e  
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
>     @     0x7f2bcc2dbc7e  
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
>     @     0x7f2bbda376df  (unknown)
>     @     0x7f2bbd54a6db  start_thread
>     @     0x7f2bbd27371f  clone
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to