[Emc-developers] Problematic race condition in LinuxCNC/HAL startup

Sebastian Kuzminsky Mon, 24 Jan 2022 21:46:08 -0800

Lately I've been seeing problems on the buildbot with tests hanging atstartup, for example this (search for the "startup-state" test):


http://buildbot.linuxcnc.org/buildbot-admin/builders/1640.rip-buster-rtpreempt-amd64/builds/2096/steps/runtests/logs/stdio

That test used the `linuxcnc` launcher script to start LinuxCNC, butLinuxCNC failed to start up - the `iocontrol` component was never found.Each one of those "." characters after "Waiting for component'iocontrol' to become ready" is 10 ms, and there are over 100,000 ofthem. Eventually someone emailed me and let me know that the buildbotwas hung again, and I intervened manually and killed the test. Thiskind of thing has happened many times recently.

The situation on the buildbot worker VM with the hung test is always thesame: the `linuxcnc` launcher has started uspace realtime, the veryfirst `halcmd` has *tried* to start `io`, `io` is up and running andseems fine, but `halcmd` doesn't notice that `io` is running. `io` isin it's "I'm up and running and waiting for NML commands and HAL pinchanges" state. `halcmd` is in its "I've spawned a user comp, I see thatit hasn't died, and I'm waiting for it to register itself in HAL's listof components" state.


Curious!

After digging in to this, I've tracked the problem down to a HAL startuprace condition that's has been with us since the very first commit thatadded HAL. The race usually doesn't cause problems, but I recentlyredid the load balancing of VMs on the buildbot cluster (thanks Sam!),and timing behavior in the VMs was perturbed. I guess the newconfiguration just happens to trigger these races more often.


Here's what's going on...

There are two ways to start realtime stuff in the LinuxCNC world: the`linuxcnc` launcher program and `halrun`. They both do the same thing:


1.  Run the `realtime` script, which puts in place the infrastructure
    needed to run realtime stuff.  On RTAI this consists of loading the
    RTAI kernel modules, and on uspace this is a no-op since everything
    needed is already part of the Linux Preempt-RT kernel.

2.  Run `halcmd` to load and configure the different realtime and
    non-realtime (aka "userspace") components, etc.  That first `halcmd`
    does the initial setup of the HAL infrastructure, which is then used
    by all later components (sort of, details below).

The first halcmd begins by running `hal_init()`[0]. `hal_init()`connects to HAL (ie, to the `hal_data` shared memory region), creatingit if it doesn't exist (as it doesn't, this first time through).Creating `hal_data` consists of allocating a shared memory region using`rtapi_shmem_new()`[1] and initializing the HAL data structure in itusing `init_hal_data()`[2]. Then we're done with `hal_init()`.


[0]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L168

[1]:https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/rtapi/uspace_common.h#L55


[2]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L2782

Next `halcmd` executes `loadusr iocontrol` to start `io`. It runs`do_loadusr_cmd()`, which calls `hal_systemv_nowait()`[3]. This iswhere things get funky. It first calls `hal_exit()`[4], which (sincethis is the only user of HAL at the moment) *deallocates* the `hal_data`shared memory region that it just allocated and initialized. At thispoint there is now *no* HAL. halcmd then forks, and the child execs`io`. The parent calls `hal_init()` again, to reconnect/reinitializeHAL. But *while halcmd is doing that*, `io` is starting up and running`hal_init()` at the same time, and this is the race.

[3]:https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/utils/halcmd.c#L189


[4]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L286

It's unclear to me why `halcmd` runs `hal_exit()` in`hal_systemv_nowait()`. It seems like a needless complexity - maybesomeone knows and can explain this?

But regardless, there's a real race in `hal_init()` that needs to beaddressed.


The race happens as follows:

1. Both processes (`halcmd` and `io` in this case) run `hal_init()`[5],and both run `rtapi_shmem_new()`[6] with key=HAL_KEY. Both`rtapi_shmem_new()` processes don't find HAL_KEY in their list of sharedmemory regions, so they both call `shmget()` and get pointers to thesame shared memory region. Since this is a new region (this key hadzero references before we got here), `shmget()` guarantees that theregion is initialized to all-bytes-zero[7].

2. Both processes continue in `hal_init()` and call`init_hal_data()`[8], and this is where things fall apart.

3. They both do an unlocked(!) check of `hal_data->version`, they bothsee it's zero which means they need to initialize it.

4. They both trylock(!) the uninitialized(!) `hal_data->mutex`, and theydon't check the return value. This means they both think they hold thelock, and both think they are the one that should initialize `hal_data`.

5. The first process to run (`io` in the buildbot case above)initializes the `hal_data` structure, then unlocks, `init_hal_data()`returns to `hal_init()`, which adds the comp to the `hal_data` comp list.

6. The second process (`halcmd` in our case) then re-initializes the`hal_data`, which throws away the list of comps that had `io` in it, andnow thing are irreparably broken - `halcmd` will never find `io`,because `io` is not in the list of hal comps any more.


[5]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L168

[6]:https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/rtapi/uspace_common.h#L55


[7]: https://man7.org/linux/man-pages/man2/shmget.2.html#DESCRIPTION

[8]: https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/hal/hal_lib.c#L2782

Why are things this way? The obvious solution *for uspace* is to change`init_hal_data()` to take `hal_data->mutex`, using `rtapi_mutex_get()`not `rtapi_mutex_try()`, before checking `hal_data->version`, likethis[9]. Since `shmget()` guarantees that a newly minted shared memoryblock is initialized to all bytes zero, and since all-bytes-zero is avalid, unlocked rtapi mutex, everything is good.

[9]:https://github.com/LinuxCNC/linuxcnc/commit/2f6c01fa09850b693e16e0dd50b6933b1a796fdb

I think possibly the weird startup stems from a shortcoming in RTAI,whose documentation[10][11][12] does *not* promise to initialize allbytes to zero the first time. It seems we thus can't rely on`hal_data->mutex` being a valid unlocked mutex the first time, and thus:trylock, and the disaster that unfolds.

[10]:https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga63

[11]:https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga68

[12]:https://www.rtai.org/userfiles/documentation/magma/html/api/group__shm.html#ga24

But in my experimentation here locally, it appears that RTAI *does* zeroshared memory the first time you allocate it. It makes sense: if itdidn't then userspace could look at old kernel memory, which is possiblya security problem... Can any RTAI experts hit me with the clue bat here?

(There's a whole other weirdness in the `rt_shmem_new()` implementationin RTAI RTAPI, but i haven't looked at that in any detail.)

I have tested the proposed fix above on uspace (on Buster) and RTAI 3.9(on Wheezy) and it's been solid here. I'd really like feedback beforemerging this into 2.8 & master...



--
Sebastian Kuzminsky


_______________________________________________
Emc-developers mailing list
Emc-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/emc-developers

[Emc-developers] Problematic race condition in LinuxCNC/HAL startup

Reply via email to