Lately I've been seeing problems on the buildbot with tests hanging at startup, for example this (search for the "startup-state" test):

http://buildbot.linuxcnc.org/buildbot-admin/builders/1640.rip-buster-rtpreempt-amd64/builds/2096/steps/runtests/logs/stdio

That test used the `linuxcnc` launcher script to start LinuxCNC, but LinuxCNC failed to start up - the `iocontrol` component was never found. Each one of those "." characters after "Waiting for component 'iocontrol' to become ready" is 10 ms, and there are over 100,000 of them. Eventually someone emailed me and let me know that the buildbot was hung again, and I intervened manually and killed the test. This kind of thing has happened many times recently.

The situation on the buildbot worker VM with the hung test is always the same: the `linuxcnc` launcher has started uspace realtime, the very first `halcmd` has *tried* to start `io`, `io` is up and running and seems fine, but `halcmd` doesn't notice that `io` is running. `io` is in it's "I'm up and running and waiting for NML commands and HAL pin changes" state. `halcmd` is in its "I've spawned a user comp, I see that it hasn't died, and I'm waiting for it to register itself in HAL's list of components" state.

Curious!

After digging in to this, I've tracked the problem down to a HAL startup race condition that's has been with us since the very first commit that added HAL. The race usually doesn't cause problems, but I recently redid the load balancing of VMs on the buildbot cluster (thanks Sam!), and timing behavior in the VMs was perturbed. I guess the new configuration just happens to trigger these races more often.

Here's what's going on...

There are two ways to start realtime stuff in the LinuxCNC world: the `linuxcnc` launcher program and `halrun`. They both do the same thing:

1.  Run the `realtime` script, which puts in place the infrastructure
    needed to run realtime stuff.  On RTAI this consists of loading the
    RTAI kernel modules, and on uspace this is a no-op since everything
    needed is already part of the Linux Preempt-RT kernel.

2.  Run `halcmd` to load and configure the different realtime and
    non-realtime (aka "userspace") components, etc.  That first `halcmd`
    does the initial setup of the HAL infrastructure, which is then used
    by all later components (sort of, details below).

The first halcmd begins by running `hal_init()`[0]. `hal_init()` connects to HAL (ie, to the `hal_data` shared memory region), creating it if it doesn't exist (as it doesn't, this first time through). Creating `hal_data` consists of allocating a shared memory region using `rtapi_shmem_new()`[1] and initializing the HAL data structure in it using `init_hal_data()`[2]. Then we're done with `hal_init()`.

[0]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L168

[1]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/rtapi/uspace_common.h#L55

[2]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L2782

Next `halcmd` executes `loadusr iocontrol` to start `io`. It runs `do_loadusr_cmd()`, which calls `hal_systemv_nowait()`[3]. This is where things get funky. It first calls `hal_exit()`[4], which (since this is the only user of HAL at the moment) *deallocates* the `hal_data` shared memory region that it just allocated and initialized. At this point there is now *no* HAL. halcmd then forks, and the child execs `io`. The parent calls `hal_init()` again, to reconnect/reinitialize HAL. But *while halcmd is doing that*, `io` is starting up and running `hal_init()` at the same time, and this is the race.

[3]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/utils/halcmd.c#L189

[4]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L286

It's unclear to me why `halcmd` runs `hal_exit()` in `hal_systemv_nowait()`. It seems like a needless complexity - maybe someone knows and can explain this?

But regardless, there's a real race in `hal_init()` that needs to be addressed.

The race happens as follows:

1. Both processes (`halcmd` and `io` in this case) run `hal_init()`[5], and both run `rtapi_shmem_new()`[6] with key=HAL_KEY. Both `rtapi_shmem_new()` processes don't find HAL_KEY in their list of shared memory regions, so they both call `shmget()` and get pointers to the same shared memory region. Since this is a new region (this key had zero references before we got here), `shmget()` guarantees that the region is initialized to all-bytes-zero[7].

2. Both processes continue in `hal_init()` and call `init_hal_data()`[8], and this is where things fall apart.

3. They both do an unlocked(!) check of `hal_data->version`, they both see it's zero which means they need to initialize it.

4. They both trylock(!) the uninitialized(!) `hal_data->mutex`, and they don't check the return value. This means they both think they hold the lock, and both think they are the one that should initialize `hal_data`.

5. The first process to run (`io` in the buildbot case above) initializes the `hal_data` structure, then unlocks, `init_hal_data()` returns to `hal_init()`, which adds the comp to the `hal_data` comp list.

6. The second process (`halcmd` in our case) then re-initializes the `hal_data`, which throws away the list of comps that had `io` in it, and now thing are irreparably broken - `halcmd` will never find `io`, because `io` is not in the list of hal comps any more.

[5]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L168

[6]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/rtapi/uspace_common.h#L55

[7]: https://man7.org/linux/man-pages/man2/shmget.2.html#DESCRIPTION

[8]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L2782

Why are things this way? The obvious solution *for uspace* is to change `init_hal_data()` to take `hal_data->mutex`, using `rtapi_mutex_get()` not `rtapi_mutex_try()`, before checking `hal_data->version`, like this[9]. Since `shmget()` guarantees that a newly minted shared memory block is initialized to all bytes zero, and since all-bytes-zero is a valid, unlocked rtapi mutex, everything is good.

[9]: https://github.com/LinuxCNC/linuxcnc/commit/2f6c01fa09850b693e16e0dd50b6933b1a796fdb

I think possibly the weird startup stems from a shortcoming in RTAI, whose documentation[10][11][12] does *not* promise to initialize all bytes to zero the first time. It seems we thus can't rely on `hal_data->mutex` being a valid unlocked mutex the first time, and thus: trylock, and the disaster that unfolds.

[10]: https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga63

[11]: https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga68

[12]: https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga24

But in my experimentation here locally, it appears that RTAI *does* zero shared memory the first time you allocate it. It makes sense: if it didn't then userspace could look at old kernel memory, which is possibly a security problem... Can any RTAI experts hit me with the clue bat here?

(There's a whole other weirdness in the `rt_shmem_new()` implementation in RTAI RTAPI, but i haven't looked at that in any detail.)

I have tested the proposed fix above on uspace (on Buster) and RTAI 3.9 (on Wheezy) and it's been solid here. I'd really like feedback before merging this into 2.8 & master...


--
Sebastian Kuzminsky


_______________________________________________
Emc-developers mailing list
Emc-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/emc-developers

Reply via email to