Qian Zhang created MESOS-9909:
---------------------------------

             Summary: Mesos agent crashes after recovery when there is nested 
container joins a CNI network
                 Key: MESOS-9909
                 URL: https://issues.apache.org/jira/browse/MESOS-9909
             Project: Mesos
          Issue Type: Bug
          Components: cni, containerization
    Affects Versions: 1.8.0, 1.7.2, 1.7.1, 1.7.0, 1.6.2, 1.6.1, 1.6.0, 1.8.1
            Reporter: Qian Zhang


Reproduce steps:

1. Use `mesos-execute` to launch a task group with checkpoint enabled. The task 
in the task group joins a CNI network `net1` and has health check enabled, and 
the health check will succeed for the first time, fail for the second time, and 
succeed for the third time, ... The reason that we do health check in this way 
is that we want to keep generating status update for this task after recovery.
{code:java}
$ mesos-execute --master=<masterIP>:5050 
--task_group=file:///tmp/task_group.json --checkpoint
$ cat /tmp/task_group.json
{
  "tasks":[
    {
      "name" : "test",
      "task_id" : {"value" : "test"},
      "agent_id": {"value" : ""},
      "resources": [
        {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
        {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
      ],
      "command": {
        "value": "ip a && sleep 55555"
      },
      "container": {
        "type": "MESOS",
        "network_infos": [
          {
            "name": "net1"
          }
        ]
      },
      "health_check": {
        "type": "COMMAND",
        "command": {
          "value": "if test -f file; then rm -rf file && exit 1; else touch 
file && exit 0; fi"
        }
      }
    }
  ]
}
{code}
 2. Restart Mesos agent, and then we will see Mesos agent crashes when it 
handles `TASK_RUNNING` status update triggered by the health check.
{code:java}
I0728 16:44:34.485939 3513 slave.cpp:5702] Handling status update TASK_RUNNING 
(Status UUID: 81fa5c56-4d79-4da4-846a-05e94591728b) for task test in health 
state healthy of framework 990a6379-5727-4490-9abe-7869ff8a1cf2-0000
F0728 16:44:34.528841 3510 cni.cpp:1462] 
CHECK_SOME(containerNetwork.networkInfo): is NONE
*** Check failure stack trace: ***
@ 0x7ffff5000e12 google::LogMessage::Fail()
@ 0x7ffff5000d5b google::LogMessage::SendToLog()
@ 0x7ffff50006e7 google::LogMessage::Flush()
@ 0x7ffff5003dfe google::LogMessageFatal::~LogMessageFatal()
@ 0x5555555f90b0 _CheckFatal::~_CheckFatal()
@ 0x7ffff372f994 mesos::internal::slave::NetworkCniIsolatorProcess::status()
@ 0x7ffff2e16a90 
_ZZN7process8dispatchIN5mesos15ContainerStatusENS1_8internal5slave20MesosIsolatorProcessERKNS1_11ContainerIDES8_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSD_FSB_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteISO_EEOS6_PNS_11ProcessBaseEE_clESR_SS_SU_
@ 0x7ffff2e20d57 
_ZN5cpp176invokeIZN7process8dispatchIN5mesos15ContainerStatusENS3_8internal5slave20MesosIsolatorProcessERKNS3_11ContainerIDESA_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS4_EESt14default_deleteISQ_EEOS8_PNS1_11ProcessBaseEE_JST_S8_SW_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSY_
@ 0x7ffff2e1ff2f 
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEE13invoke_expandISY_St5tupleIJSU_S9_S10_EES13_IJOSX_EEJLm0ELm1ELm2EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISG_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSG_N5cpp1416integer_sequenceImJXspT2_EEEEOSK_
@ 0x7ffff2e1f75e 
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEEclIJSX_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS16_
@ 0x7ffff2e1f20e 
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS6_8internal5slave20MesosIsolatorProcessERKNS6_11ContainerIDESD_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSI_FSG_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS7_EESt14default_deleteIST_EEOSB_PNS4_11ProcessBaseEE_JSW_SB_St12_PlaceholderILi1EEEEEJSZ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS14_
@ 0x7ffff2e1ef11 
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDESE_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS8_EESt14default_deleteISU_EEOSC_PNS5_11ProcessBaseEE_JSX_SC_St12_PlaceholderILi1EEEEEJS10_EEEvOSG_DpOT0_
@ 0x7ffff2e1ead6 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos15ContainerStatusENSA_8internal5slave20MesosIsolatorProcessERKNSA_11ContainerIDESH_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSM_FSK_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteISX_EEOSF_S3_E_JS10_SF_St12_PlaceholderILi1EEEEEEclEOS3_
@ 0x7ffff4f0ad6b _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
@ 0x7ffff4ecdb4a process::ProcessBase::consume()
@ 0x7ffff4ef79d0 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
@ 0x5555555f9c1e process::ProcessBase::serve()
@ 0x7ffff4eca4e8 process::ProcessManager::resume()
@ 0x7ffff4ec695e _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
@ 0x7ffff4ed5c7f 
_ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
@ 0x7ffff4ed28ae 
_ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
@ 0x7ffff4ef0e2c 
_ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
@ 0x7ffff4eefea0 
_ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
@ 0x7ffff4eeece0 
_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
@ 0x7fffe6eb957f (unknown)
@ 0x7fffe69cc6db start_thread
@ 0x7fffe66f588f clone
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to