[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184072#comment-16184072
 ] 

Andrei Budnik commented on MESOS-7500:
--------------------------------------

Command health checks are executed via `LAUNCH_NESTED_CONTAINER_SESSION` call 
and launched inside DEBUG container.
DEBUG container is always launched in pair with `mesos-io-switcboard` process. 
After spawning `mesos-io-switcboard` agent tries to connect to it via unix 
domain socket. If DEBUG container exits before `mesos-io-switcboard` exits, 
agent sends SIGTERM to switchboard process after 5 second delay. If 
`mesos-switchboard-process` exits after being killed by signal, then 
`LAUNCH_NESTED_CONTAINER_SESSION` call is considered to be failed as well as 
corresponding health check.
It turned out that `mesos-io-switchboard` is not an executable, but a special 
wrapper script generated by libtool. First time this script is executed, 
relinking of an executable triggered. Relinking takes quite a while on slow 
machines (e.g. in Apache CI): I've seen 8 seconds and more. It turned out, that 
when DEBUG container exits, agent sends SIGTERM (as described above) to a 
process which is still being relinking. This happens each time health check is 
launched and as the result we see a bunch of failed tests in Apache CI.
To fix this issue we need to force libtool/autotools to generate binary instead 
of wrapper script, see:
1. https://autotools.io/libtool/wrappers.html
2. `info libtool`

> Command checks via agent lead to flaky tests.
> ---------------------------------------------
>
>                 Key: MESOS-7500
>                 URL: https://issues.apache.org/jira/browse/MESOS-7500
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to