[ 
https://issues.apache.org/jira/browse/MESOS-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-9909:
---------------------------------

    Shepherd: Gilbert Song
    Assignee: Qian Zhang

RR: [https://reviews.apache.org/r/71174/]

> Mesos agent crashes after recovery when there is nested container joins a CNI 
> network
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-9909
>                 URL: https://issues.apache.org/jira/browse/MESOS-9909
>             Project: Mesos
>          Issue Type: Bug
>          Components: cni, containerization
>    Affects Versions: 1.8.0, 1.7.2, 1.7.1, 1.7.0, 1.6.2, 1.6.1, 1.6.0, 1.8.1
>            Reporter: Qian Zhang
>            Assignee: Qian Zhang
>            Priority: Major
>              Labels: cni, containerization
>
> Reproduce steps:
> 1. Use `mesos-execute` to launch a task group with checkpoint enabled. The 
> task in the task group joins a CNI network `net1` and has health check 
> enabled, and the health check will succeed for the first time, fail for the 
> second time, and succeed for the third time, ... The reason that we do health 
> check in this way is that we want to keep generating status updates for this 
> task after recovery.
> {code:java}
> $ mesos-execute --master=<masterIP>:5050 
> --task_group=file:///tmp/task_group.json --checkpoint
> $ cat /tmp/task_group.json
> {
>   "tasks":[
>     {
>       "name" : "test",
>       "task_id" : {"value" : "test"},
>       "agent_id": {"value" : ""},
>       "resources": [
>         {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
>         {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
>       ],
>       "command": {
>         "value": "ip a && sleep 55555"
>       },
>       "container": {
>         "type": "MESOS",
>         "network_infos": [
>           {
>             "name": "net1"
>           }
>         ]
>       },
>       "health_check": {
>         "type": "COMMAND",
>         "command": {
>           "value": "if test -f file; then rm -rf file && exit 1; else touch 
> file && exit 0; fi"
>         }
>       }
>     }
>   ]
> }
> {code}
>  2. Restart Mesos agent, and then we will see Mesos agent crashes when it 
> handles `TASK_RUNNING` status update triggered by the health check.
> {code:java}
> I0728 16:44:34.485939 3513 slave.cpp:5702] Handling status update 
> TASK_RUNNING (Status UUID: 81fa5c56-4d79-4da4-846a-05e94591728b) for task 
> test in health state healthy of framework 
> 990a6379-5727-4490-9abe-7869ff8a1cf2-0000
> F0728 16:44:34.528841 3510 cni.cpp:1462] 
> CHECK_SOME(containerNetwork.networkInfo): is NONE
> *** Check failure stack trace: ***
> @ 0x7ffff5000e12 google::LogMessage::Fail()
> @ 0x7ffff5000d5b google::LogMessage::SendToLog()
> @ 0x7ffff50006e7 google::LogMessage::Flush()
> @ 0x7ffff5003dfe google::LogMessageFatal::~LogMessageFatal()
> @ 0x5555555f90b0 _CheckFatal::~_CheckFatal()
> @ 0x7ffff372f994 mesos::internal::slave::NetworkCniIsolatorProcess::status()
> @ 0x7ffff2e16a90 
> _ZZN7process8dispatchIN5mesos15ContainerStatusENS1_8internal5slave20MesosIsolatorProcessERKNS1_11ContainerIDES8_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSD_FSB_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteISO_EEOS6_PNS_11ProcessBaseEE_clESR_SS_SU_
> @ 0x7ffff2e20d57 
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos15ContainerStatusENS3_8internal5slave20MesosIsolatorProcessERKNS3_11ContainerIDESA_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS4_EESt14default_deleteISQ_EEOS8_PNS1_11ProcessBaseEE_JST_S8_SW_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSY_
> @ 0x7ffff2e1ff2f 
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEE13invoke_expandISY_St5tupleIJSU_S9_S10_EES13_IJOSX_EEJLm0ELm1ELm2EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISG_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSG_N5cpp1416integer_sequenceImJXspT2_EEEEOSK_
> @ 0x7ffff2e1f75e 
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEEclIJSX_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS16_
> @ 0x7ffff2e1f20e 
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS6_8internal5slave20MesosIsolatorProcessERKNS6_11ContainerIDESD_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSI_FSG_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS7_EESt14default_deleteIST_EEOSB_PNS4_11ProcessBaseEE_JSW_SB_St12_PlaceholderILi1EEEEEJSZ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS14_
> @ 0x7ffff2e1ef11 
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDESE_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS8_EESt14default_deleteISU_EEOSC_PNS5_11ProcessBaseEE_JSX_SC_St12_PlaceholderILi1EEEEEJS10_EEEvOSG_DpOT0_
> @ 0x7ffff2e1ead6 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos15ContainerStatusENSA_8internal5slave20MesosIsolatorProcessERKNSA_11ContainerIDESH_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSM_FSK_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteISX_EEOSF_S3_E_JS10_SF_St12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7ffff4f0ad6b _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7ffff4ecdb4a process::ProcessBase::consume()
> @ 0x7ffff4ef79d0 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @ 0x5555555f9c1e process::ProcessBase::serve()
> @ 0x7ffff4eca4e8 process::ProcessManager::resume()
> @ 0x7ffff4ec695e _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
> @ 0x7ffff4ed5c7f 
> _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
> @ 0x7ffff4ed28ae 
> _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
> @ 0x7ffff4ef0e2c 
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
> @ 0x7ffff4eefea0 
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
> @ 0x7ffff4eeece0 
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
> @ 0x7fffe6eb957f (unknown)
> @ 0x7fffe69cc6db start_thread
> @ 0x7fffe66f588f clone
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to