[
https://issues.apache.org/jira/browse/MESOS-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894661#comment-16894661
]
Qian Zhang commented on MESOS-9909:
-----------------------------------
Mesos agent crashes at [this
line|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1462],
the root cause is during recovery CNI isolator will NOT recover network info
from executor info for nested container because there is not executor info for
nested containers, see
[here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L541:L549]
for details.
> Mesos agent crashes after recovery when there is nested container joins a CNI
> network
> -------------------------------------------------------------------------------------
>
> Key: MESOS-9909
> URL: https://issues.apache.org/jira/browse/MESOS-9909
> Project: Mesos
> Issue Type: Bug
> Components: cni, containerization
> Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1
> Reporter: Qian Zhang
> Priority: Major
> Labels: cni, containerization
>
> Reproduce steps:
> 1. Use `mesos-execute` to launch a task group with checkpoint enabled. The
> task in the task group joins a CNI network `net1` and has health check
> enabled, and the health check will succeed for the first time, fail for the
> second time, and succeed for the third time, ... The reason that we do health
> check in this way is that we want to keep generating status updates for this
> task after recovery.
> {code:java}
> $ mesos-execute --master=<masterIP>:5050
> --task_group=file:///tmp/task_group.json --checkpoint
> $ cat /tmp/task_group.json
> {
> "tasks":[
> {
> "name" : "test",
> "task_id" : {"value" : "test"},
> "agent_id": {"value" : ""},
> "resources": [
> {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
> {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
> ],
> "command": {
> "value": "ip a && sleep 55555"
> },
> "container": {
> "type": "MESOS",
> "network_infos": [
> {
> "name": "net1"
> }
> ]
> },
> "health_check": {
> "type": "COMMAND",
> "command": {
> "value": "if test -f file; then rm -rf file && exit 1; else touch
> file && exit 0; fi"
> }
> }
> }
> ]
> }
> {code}
> 2. Restart Mesos agent, and then we will see Mesos agent crashes when it
> handles `TASK_RUNNING` status update triggered by the health check.
> {code:java}
> I0728 16:44:34.485939 3513 slave.cpp:5702] Handling status update
> TASK_RUNNING (Status UUID: 81fa5c56-4d79-4da4-846a-05e94591728b) for task
> test in health state healthy of framework
> 990a6379-5727-4490-9abe-7869ff8a1cf2-0000
> F0728 16:44:34.528841 3510 cni.cpp:1462]
> CHECK_SOME(containerNetwork.networkInfo): is NONE
> *** Check failure stack trace: ***
> @ 0x7ffff5000e12 google::LogMessage::Fail()
> @ 0x7ffff5000d5b google::LogMessage::SendToLog()
> @ 0x7ffff50006e7 google::LogMessage::Flush()
> @ 0x7ffff5003dfe google::LogMessageFatal::~LogMessageFatal()
> @ 0x5555555f90b0 _CheckFatal::~_CheckFatal()
> @ 0x7ffff372f994 mesos::internal::slave::NetworkCniIsolatorProcess::status()
> @ 0x7ffff2e16a90
> _ZZN7process8dispatchIN5mesos15ContainerStatusENS1_8internal5slave20MesosIsolatorProcessERKNS1_11ContainerIDES8_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSD_FSB_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteISO_EEOS6_PNS_11ProcessBaseEE_clESR_SS_SU_
> @ 0x7ffff2e20d57
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos15ContainerStatusENS3_8internal5slave20MesosIsolatorProcessERKNS3_11ContainerIDESA_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS4_EESt14default_deleteISQ_EEOS8_PNS1_11ProcessBaseEE_JST_S8_SW_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSY_
> @ 0x7ffff2e1ff2f
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEE13invoke_expandISY_St5tupleIJSU_S9_S10_EES13_IJOSX_EEJLm0ELm1ELm2EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISG_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSG_N5cpp1416integer_sequenceImJXspT2_EEEEOSK_
> @ 0x7ffff2e1f75e
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEEclIJSX_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS16_
> @ 0x7ffff2e1f20e
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS6_8internal5slave20MesosIsolatorProcessERKNS6_11ContainerIDESD_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSI_FSG_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS7_EESt14default_deleteIST_EEOSB_PNS4_11ProcessBaseEE_JSW_SB_St12_PlaceholderILi1EEEEEJSZ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS14_
> @ 0x7ffff2e1ef11
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDESE_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS8_EESt14default_deleteISU_EEOSC_PNS5_11ProcessBaseEE_JSX_SC_St12_PlaceholderILi1EEEEEJS10_EEEvOSG_DpOT0_
> @ 0x7ffff2e1ead6
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos15ContainerStatusENSA_8internal5slave20MesosIsolatorProcessERKNSA_11ContainerIDESH_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSM_FSK_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteISX_EEOSF_S3_E_JS10_SF_St12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7ffff4f0ad6b _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7ffff4ecdb4a process::ProcessBase::consume()
> @ 0x7ffff4ef79d0 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @ 0x5555555f9c1e process::ProcessBase::serve()
> @ 0x7ffff4eca4e8 process::ProcessManager::resume()
> @ 0x7ffff4ec695e _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
> @ 0x7ffff4ed5c7f
> _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
> @ 0x7ffff4ed28ae
> _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
> @ 0x7ffff4ef0e2c
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
> @ 0x7ffff4eefea0
> _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
> @ 0x7ffff4eeece0
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
> @ 0x7fffe6eb957f (unknown)
> @ 0x7fffe69cc6db start_thread
> @ 0x7fffe66f588f clone
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)