[
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184072#comment-16184072
]
Andrei Budnik commented on MESOS-7500:
--------------------------------------
Command health checks are executed via `LAUNCH_NESTED_CONTAINER_SESSION` call
and launched inside DEBUG container.
DEBUG container is always launched in pair with `mesos-io-switcboard` process.
After spawning `mesos-io-switcboard` agent tries to connect to it via unix
domain socket. If DEBUG container exits before `mesos-io-switcboard` exits,
agent sends SIGTERM to switchboard process after 5 second delay. If
`mesos-switchboard-process` exits after being killed by signal, then
`LAUNCH_NESTED_CONTAINER_SESSION` call is considered to be failed as well as
corresponding health check.
It turned out that `mesos-io-switchboard` is not an executable, but a special
wrapper script generated by libtool. First time this script is executed,
relinking of an executable triggered. Relinking takes quite a while on slow
machines (e.g. in Apache CI): I've seen 8 seconds and more. It turned out, that
when DEBUG container exits, agent sends SIGTERM (as described above) to a
process which is still being relinking. This happens each time health check is
launched and as the result we see a bunch of failed tests in Apache CI.
To fix this issue we need to force libtool/autotools to generate binary instead
of wrapper script, see:
1. https://autotools.io/libtool/wrappers.html
2. `info libtool`
> Command checks via agent lead to flaky tests.
> ---------------------------------------------
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
> Issue Type: Bug
> Reporter: Alexander Rukletsov
> Assignee: Andrei Budnik
> Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is
> an example from one of the failed run: https://pastebin.com/g2mPgYzu
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)