[
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944546#comment-15944546
]
Deshi Xiao commented on MESOS-6184:
-----------------------------------
i have rebase the patch to 1.2.0 branch codebase. and testing it, it always get
coredump file.
```
I0328 11:48:12.922181 48 exec.cpp:162] Version: 1.2.0
I0328 11:48:12.929252 54 exec.cpp:237] Executor registered on agent
a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4
I0328 11:48:12.931640 54 docker.cpp:850] Running docker -H
unix:///var/run/docker.sock run --cpu-shares 10 --memory 33554432 --env-file
/tmp/gvqGyb -v
/data/mesos/slaves/a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4/frameworks/d7ef5d2b-f924-42d9-a274-c020afba6bce-0000/executors/0-hc-xychu-datamanmesos-2f3b47f9ffc048539c7b22baa6c32d8f/runs/458189b8-2ff4-4337-ad3a-67321e96f5cb:/mnt/mesos/sandbox
--net bridge --label=USER_NAME=xychu --label=GROUP_NAME=groupautotest
--label=APP_ID=hc --label=VCLUSTER=clusterautotest --label=USER=xychu
--label=CLUSTER=datamanmesos --label=SLOT=0 --label=APP=hc -p 31000:80/tcp
--name
mesos-a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4.458189b8-2ff4-4337-ad3a-67321e96f5cb
nginx
I0328 11:48:16.145714 53 health_checker.cpp:196] Ignoring failure as health
check still in grace period
W0328 11:48:26.289958 49 health_checker.cpp:202] Health check failed 1 times
consecutively: HTTP health check failed: curl returned terminated with signal
Aborted (core dumped): ABORT:
(../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190): Failed
to execute Subprocess::ChildHook: Failed to enter the net namespace of pid
18596: Pid 18596 does not exist
*** Aborted at 1490672906 (unix time) try "date -d @1490672906" if you are
using GNU date ***
PC: @ 0x7f26bfb485f7 __GI_raise
*** SIGABRT (@0x4a) received by PID 74 (TID 0x7f26ba152700) from PID 74; stack
trace: ***
@ 0x7f26c0703100 (unknown)
@ 0x7f26bfb485f7 __GI_raise
@ 0x7f26bfb49ce8 __GI_abort
@ 0x7f26c315778e _Abort()
@ 0x7f26c31577cc _Abort()
@ 0x7f26c237a4b6 process::internal::childMain()
@ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
@ 0x7f26c2379e53 process::internal::defaultClone()
@ 0x7f26c237b951 process::internal::cloneChild()
@ 0x7f26c237954f process::subprocess()
@ 0x7f26c15a9fb1
mesos::internal::checks::HealthCheckerProcess::httpHealthCheck()
@ 0x7f26c15ababd
mesos::internal::checks::HealthCheckerProcess::performSingleCheck()
@ 0x7f26c2331389 process::ProcessManager::resume()
@ 0x7f26c233a3f7
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f26c04a1220 (unknown)
@ 0x7f26c06fbdc5 start_thread
@ 0x7f26bfc0928d __clone
W0328 11:48:36.340055 55 health_checker.cpp:202] Health check failed 2 times
consecutively: HTTP health check failed: curl returned terminated with signal
Aborted (core dumped): ABORT:
(../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190): Failed
to execute Subprocess::ChildHook: Failed to enter the net namespace of pid
18596: Pid 18596 does not exist
*** Aborted at 1490672916 (unix time) try "date -d @1490672916" if you are
using GNU date ***
PC: @ 0x7f26bfb485f7 __GI_raise
*** SIGABRT (@0x4b) received by PID 75 (TID 0x7f26b9951700) from PID 75; stack
trace: ***
@ 0x7f26c0703100 (unknown)
@ 0x7f26bfb485f7 __GI_raise
@ 0x7f26bfb49ce8 __GI_abort
@ 0x7f26c315778e _Abort()
@ 0x7f26c31577cc _Abort()
@ 0x7f26c237a4b6 process::internal::childMain()
@ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
@ 0x7f26c2379e53 process::internal::defaultClone()
@ 0x7f26c237b951 process::internal::cloneChild()
@ 0x7f26c237954f process::subprocess()
@ 0x7f26c15a9fb1
mesos::internal::checks::HealthCheckerProcess::httpHealthCheck()
@ 0x7f26c15ababd
mesos::internal::checks::HealthCheckerProcess::performSingleCheck()
@ 0x7f26c2331389 process::ProcessManager::resume()
@ 0x7f26c233a3f7
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f26c04a1220 (unknown)
@ 0x7f26c06fbdc5 start_thread
@ 0x7f26bfc0928d __clone
W0328 11:48:46.386533 49 health_checker.cpp:202] Health check failed 3 times
consecutively: HTTP health check failed: curl returned terminated with signal
Aborted (core dumped): ABORT:
(../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190): Failed
to execute Subprocess::ChildHook: Failed to enter the net namespace of pid
18596: Pid 18596 does not exist
*** Aborted at 1490672926 (unix time) try "date -d @1490672926" if you are
using GNU date ***
PC: @ 0x7f26bfb485f7 __GI_raise
*** SIGABRT (@0x4c) received by PID 76 (TID 0x7f26ba152700) from PID 76; stack
trace: ***
@ 0x7f26c0703100 (unknown)
@ 0x7f26bfb485f7 __GI_raise
@ 0x7f26bfb49ce8 __GI_abort
@ 0x7f26c315778e _Abort()
@ 0x7f26c31577cc _Abort()
@ 0x7f26c237a4b6 process::internal::childMain()
@ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
@ 0x7f26c2379e53 process::internal::defaultClone()
@ 0x7f26c237b951 process::internal::cloneChild()
@ 0x7f26c237954f process::subprocess()
@ 0x7f26c15a9fb1
mesos::internal::checks::HealthCheckerProcess::httpHealthCheck()
@ 0x7f26c15ababd
mesos::internal::checks::HealthCheckerProcess::performSingleCheck()
@ 0x7f26c2331389 process::ProcessManager::resume()
@ 0x7f26c233a3f7
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f26c04a1220 (unknown)
@ 0x7f26c06fbdc5 start_thread
@ 0x7f26bfc0928d __clone
W0328 11:48:56.531623 53 health_checker.cpp:202] Health check failed 4 times
consecutively: HTTP health check failed: curl returned terminated with signal
Aborted (core dumped): ABORT:
(../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190): Failed
to execute Subprocess::ChildHook: Failed to enter the net namespace of pid
18596: Pid 18596 does not exist
*** Aborted at 1490672936 (unix time) try "date -d @1490672936" if you are
using GNU date ***
PC: @ 0x7f26bfb485f7 __GI_raise
*** SIGABRT (@0x4d) received by PID 77 (TID 0x7f26b814e700) from PID 77; stack
trace: ***
@ 0x7f26c0703100 (unknown)
@ 0x7f26bfb485f7 __GI_raise
@ 0x7f26bfb49ce8 __GI_abort
@ 0x7f26c315778e _Abort()
@ 0x7f26c31577cc _Abort()
@ 0x7f26c237a4b6 process::internal::childMain()
@ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
@ 0x7f26c2379e53 process::internal::defaultClone()
@ 0x7f26c237b951 process::internal::cloneChild()
@ 0x7f26c237954f process::subprocess()
@ 0x7f26c15a9fb1
mesos::internal::checks::HealthCheckerProcess::httpHealthCheck()
@ 0x7f26c15ababd
mesos::internal::checks::HealthCheckerProcess::performSingleCheck()
@ 0x7f26c2331389 process::ProcessManager::resume()
@ 0x7f26c233a3f7
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f26c04a1220 (unknown)
@ 0x7f26c06fbdc5 start_thread
@ 0x7f26bfc0928d __clone
W0328 11:49:06.678515 50 health_checker.cpp:202] Health check failed 5 times
consecutively: HTTP health check failed: curl returned terminated with signal
Aborted (core dumped): ABORT:
(../../../3rdparty/libprocess/include/process/posix/subprocess.hpp:190): Failed
to execute Subprocess::ChildHook: Failed to enter the net namespace of pid
18596: Pid 18596 does not exist
*** Aborted at 1490672946 (unix time) try "date -d @1490672946" if you are
using GNU date ***
PC: @ 0x7f26bfb485f7 __GI_raise
*** SIGABRT (@0x4e) received by PID 78 (TID 0x7f26b9951700) from PID 78; stack
trace: ***
@ 0x7f26c0703100 (unknown)
@ 0x7f26bfb485f7 __GI_raise
@ 0x7f26bfb49ce8 __GI_abort
@ 0x7f26c315778e _Abort()
@ 0x7f26c31577cc _Abort()
@ 0x7f26c237a4b6 process::internal::childMain()
@ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
@ 0x7f26c2379e53 process::internal::defaultClone()
@ 0x7f26c237b951 process::internal::cloneChild()
@ 0x7f26c237954f process::subprocess()
@ 0x7f26c15a9fb1
mesos::internal::checks::HealthCheckerProcess::httpHealthCheck()
@ 0x7f26c15ababd
mesos::internal::checks::HealthCheckerProcess::performSingleCheck()
@ 0x7f26c2331389 process::ProcessManager::resume()
@ 0x7f26c233a3f7
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f26c04a1220 (unknown)
@ 0x7f26c06fbdc5 start_thread
@ 0x7f26bfc0928d __clone
I0328 11:49:06.678840 50 health_checker.cpp:130] Health checking stopped
I0328 11:49:06.880620 49 health_checker.cpp:130] Health checking stopped
```
> Health checks should use a general mechanism to enter namespaces of the task.
> -----------------------------------------------------------------------------
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
> Issue Type: Improvement
> Reporter: haosdent
> Assignee: haosdent
> Priority: Critical
> Labels: health-check, mesosphere
>
> To perform health checks for tasks, we need to enter the corresponding
> namespaces of the container. For now health check use custom clone to
> implement this
> {code}
> return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
> foreach (const string& ns, namespaces) {
> Try<Nothing> setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
> ...
> }
> }
> }
> return func();
> });
> {code}
> After the childHooks patches merged, we could change the health check to use
> childHooks to call {{setns}} and make {{process::defaultClone}} private
> again.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)