Lately I've been seeing problems on the buildbot with tests hanging at
startup, for example this (search for the "startup-state" test):
http://buildbot.linuxcnc.org/buildbot-admin/builders/1640.rip-buster-rtpreempt-amd64/builds/2096/steps/runtests/logs/stdio
That test used the `linuxcnc` launcher script to start LinuxCNC, but
LinuxCNC failed to start up - the `iocontrol` component was never found.
Each one of those "." characters after "Waiting for component
'iocontrol' to become ready" is 10 ms, and there are over 100,000 of
them. Eventually someone emailed me and let me know that the buildbot
was hung again, and I intervened manually and killed the test. This
kind of thing has happened many times recently.
The situation on the buildbot worker VM with the hung test is always the
same: the `linuxcnc` launcher has started uspace realtime, the very
first `halcmd` has *tried* to start `io`, `io` is up and running and
seems fine, but `halcmd` doesn't notice that `io` is running. `io` is
in it's "I'm up and running and waiting for NML commands and HAL pin
changes" state. `halcmd` is in its "I've spawned a user comp, I see that
it hasn't died, and I'm waiting for it to register itself in HAL's list
of components" state.
Curious!
After digging in to this, I've tracked the problem down to a HAL startup
race condition that's has been with us since the very first commit that
added HAL. The race usually doesn't cause problems, but I recently
redid the load balancing of VMs on the buildbot cluster (thanks Sam!),
and timing behavior in the VMs was perturbed. I guess the new
configuration just happens to trigger these races more often.
Here's what's going on...
There are two ways to start realtime stuff in the LinuxCNC world: the
`linuxcnc` launcher program and `halrun`. They both do the same thing:
1. Run the `realtime` script, which puts in place the infrastructure
needed to run realtime stuff. On RTAI this consists of loading the
RTAI kernel modules, and on uspace this is a no-op since everything
needed is already part of the Linux Preempt-RT kernel.
2. Run `halcmd` to load and configure the different realtime and
non-realtime (aka "userspace") components, etc. That first `halcmd`
does the initial setup of the HAL infrastructure, which is then used
by all later components (sort of, details below).
The first halcmd begins by running `hal_init()`[0]. `hal_init()`
connects to HAL (ie, to the `hal_data` shared memory region), creating
it if it doesn't exist (as it doesn't, this first time through).
Creating `hal_data` consists of allocating a shared memory region using
`rtapi_shmem_new()`[1] and initializing the HAL data structure in it
using `init_hal_data()`[2]. Then we're done with `hal_init()`.
[0]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L168
[1]:
https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/rtapi/uspace_common.h#L55
[2]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L2782
Next `halcmd` executes `loadusr iocontrol` to start `io`. It runs
`do_loadusr_cmd()`, which calls `hal_systemv_nowait()`[3]. This is
where things get funky. It first calls `hal_exit()`[4], which (since
this is the only user of HAL at the moment) *deallocates* the `hal_data`
shared memory region that it just allocated and initialized. At this
point there is now *no* HAL. halcmd then forks, and the child execs
`io`. The parent calls `hal_init()` again, to reconnect/reinitialize
HAL. But *while halcmd is doing that*, `io` is starting up and running
`hal_init()` at the same time, and this is the race.
[3]:
https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/utils/halcmd.c#L189
[4]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L286
It's unclear to me why `halcmd` runs `hal_exit()` in
`hal_systemv_nowait()`. It seems like a needless complexity - maybe
someone knows and can explain this?
But regardless, there's a real race in `hal_init()` that needs to be
addressed.
The race happens as follows:
1. Both processes (`halcmd` and `io` in this case) run `hal_init()`[5],
and both run `rtapi_shmem_new()`[6] with key=HAL_KEY. Both
`rtapi_shmem_new()` processes don't find HAL_KEY in their list of shared
memory regions, so they both call `shmget()` and get pointers to the
same shared memory region. Since this is a new region (this key had
zero references before we got here), `shmget()` guarantees that the
region is initialized to all-bytes-zero[7].
2. Both processes continue in `hal_init()` and call
`init_hal_data()`[8], and this is where things fall apart.
3. They both do an unlocked(!) check of `hal_data->version`, they both
see it's zero which means they need to initialize it.
4. They both trylock(!) the uninitialized(!) `hal_data->mutex`, and they
don't check the return value. This means they both think they hold the
lock, and both think they are the one that should initialize `hal_data`.
5. The first process to run (`io` in the buildbot case above)
initializes the `hal_data` structure, then unlocks, `init_hal_data()`
returns to `hal_init()`, which adds the comp to the `hal_data` comp list.
6. The second process (`halcmd` in our case) then re-initializes the
`hal_data`, which throws away the list of comps that had `io` in it, and
now thing are irreparably broken - `halcmd` will never find `io`,
because `io` is not in the list of hal comps any more.
[5]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L168
[6]:
https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/rtapi/uspace_common.h#L55
[7]: https://man7.org/linux/man-pages/man2/shmget.2.html#DESCRIPTION
[8]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L2782
Why are things this way? The obvious solution *for uspace* is to change
`init_hal_data()` to take `hal_data->mutex`, using `rtapi_mutex_get()`
not `rtapi_mutex_try()`, before checking `hal_data->version`, like
this[9]. Since `shmget()` guarantees that a newly minted shared memory
block is initialized to all bytes zero, and since all-bytes-zero is a
valid, unlocked rtapi mutex, everything is good.
[9]:
https://github.com/LinuxCNC/linuxcnc/commit/2f6c01fa09850b693e16e0dd50b6933b1a796fdb
I think possibly the weird startup stems from a shortcoming in RTAI,
whose documentation[10][11][12] does *not* promise to initialize all
bytes to zero the first time. It seems we thus can't rely on
`hal_data->mutex` being a valid unlocked mutex the first time, and thus:
trylock, and the disaster that unfolds.
[10]:
https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga63
[11]:
https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga68
[12]:
https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga24
But in my experimentation here locally, it appears that RTAI *does* zero
shared memory the first time you allocate it. It makes sense: if it
didn't then userspace could look at old kernel memory, which is possibly
a security problem... Can any RTAI experts hit me with the clue bat here?
(There's a whole other weirdness in the `rt_shmem_new()` implementation
in RTAI RTAPI, but i haven't looked at that in any detail.)
I have tested the proposed fix above on uspace (on Buster) and RTAI 3.9
(on Wheezy) and it's been solid here. I'd really like feedback before
merging this into 2.8 & master...
--
Sebastian Kuzminsky
_______________________________________________
Emc-developers mailing list
Emc-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/emc-developers