Qian Zhang created MESOS-9909:
---------------------------------
Summary: Mesos agent crashes after recovery when there is nested
container joins a CNI network
Key: MESOS-9909
URL: https://issues.apache.org/jira/browse/MESOS-9909
Project: Mesos
Issue Type: Bug
Components: cni, containerization
Affects Versions: 1.8.0, 1.7.2, 1.7.1, 1.7.0, 1.6.2, 1.6.1, 1.6.0, 1.8.1
Reporter: Qian Zhang
Reproduce steps:
1. Use `mesos-execute` to launch a task group with checkpoint enabled. The task
in the task group joins a CNI network `net1` and has health check enabled, and
the health check will succeed for the first time, fail for the second time, and
succeed for the third time, ... The reason that we do health check in this way
is that we want to keep generating status update for this task after recovery.
{code:java}
$ mesos-execute --master=<masterIP>:5050
--task_group=file:///tmp/task_group.json --checkpoint
$ cat /tmp/task_group.json
{
"tasks":[
{
"name" : "test",
"task_id" : {"value" : "test"},
"agent_id": {"value" : ""},
"resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
],
"command": {
"value": "ip a && sleep 55555"
},
"container": {
"type": "MESOS",
"network_infos": [
{
"name": "net1"
}
]
},
"health_check": {
"type": "COMMAND",
"command": {
"value": "if test -f file; then rm -rf file && exit 1; else touch
file && exit 0; fi"
}
}
}
]
}
{code}
2. Restart Mesos agent, and then we will see Mesos agent crashes when it
handles `TASK_RUNNING` status update triggered by the health check.
{code:java}
I0728 16:44:34.485939 3513 slave.cpp:5702] Handling status update TASK_RUNNING
(Status UUID: 81fa5c56-4d79-4da4-846a-05e94591728b) for task test in health
state healthy of framework 990a6379-5727-4490-9abe-7869ff8a1cf2-0000
F0728 16:44:34.528841 3510 cni.cpp:1462]
CHECK_SOME(containerNetwork.networkInfo): is NONE
*** Check failure stack trace: ***
@ 0x7ffff5000e12 google::LogMessage::Fail()
@ 0x7ffff5000d5b google::LogMessage::SendToLog()
@ 0x7ffff50006e7 google::LogMessage::Flush()
@ 0x7ffff5003dfe google::LogMessageFatal::~LogMessageFatal()
@ 0x5555555f90b0 _CheckFatal::~_CheckFatal()
@ 0x7ffff372f994 mesos::internal::slave::NetworkCniIsolatorProcess::status()
@ 0x7ffff2e16a90
_ZZN7process8dispatchIN5mesos15ContainerStatusENS1_8internal5slave20MesosIsolatorProcessERKNS1_11ContainerIDES8_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSD_FSB_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteISO_EEOS6_PNS_11ProcessBaseEE_clESR_SS_SU_
@ 0x7ffff2e20d57
_ZN5cpp176invokeIZN7process8dispatchIN5mesos15ContainerStatusENS3_8internal5slave20MesosIsolatorProcessERKNS3_11ContainerIDESA_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS4_EESt14default_deleteISQ_EEOS8_PNS1_11ProcessBaseEE_JST_S8_SW_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSY_
@ 0x7ffff2e1ff2f
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEE13invoke_expandISY_St5tupleIJSU_S9_S10_EES13_IJOSX_EEJLm0ELm1ELm2EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISG_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSG_N5cpp1416integer_sequenceImJXspT2_EEEEOSK_
@ 0x7ffff2e1f75e
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEEclIJSX_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS16_
@ 0x7ffff2e1f20e
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS6_8internal5slave20MesosIsolatorProcessERKNS6_11ContainerIDESD_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSI_FSG_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS7_EESt14default_deleteIST_EEOSB_PNS4_11ProcessBaseEE_JSW_SB_St12_PlaceholderILi1EEEEEJSZ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS14_
@ 0x7ffff2e1ef11
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDESE_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS8_EESt14default_deleteISU_EEOSC_PNS5_11ProcessBaseEE_JSX_SC_St12_PlaceholderILi1EEEEEJS10_EEEvOSG_DpOT0_
@ 0x7ffff2e1ead6
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos15ContainerStatusENSA_8internal5slave20MesosIsolatorProcessERKNSA_11ContainerIDESH_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSM_FSK_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteISX_EEOSF_S3_E_JS10_SF_St12_PlaceholderILi1EEEEEEclEOS3_
@ 0x7ffff4f0ad6b _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
@ 0x7ffff4ecdb4a process::ProcessBase::consume()
@ 0x7ffff4ef79d0 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
@ 0x5555555f9c1e process::ProcessBase::serve()
@ 0x7ffff4eca4e8 process::ProcessManager::resume()
@ 0x7ffff4ec695e _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
@ 0x7ffff4ed5c7f
_ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
@ 0x7ffff4ed28ae
_ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
@ 0x7ffff4ef0e2c
_ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
@ 0x7ffff4eefea0
_ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
@ 0x7ffff4eeece0
_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
@ 0x7fffe6eb957f (unknown)
@ 0x7fffe69cc6db start_thread
@ 0x7fffe66f588f clone
{code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)