On Tue, Apr 05, 2022 at 02:10:36PM -0000, Michael van Elst wrote: > [email protected] (Thomas Klausner) writes: > >I never saw the cmake hang myself. I still see hangs in guile. > > > I see both in almost every pbulk run.
please try this patch for the cmake variation of this hang: http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 this fixes the problem as seen with taylor's strdup / jemalloc reproducer, and paulg reports it fixes the hang in building guile too. what's going on here is that the first time that libpthread calls _lwp_park when it wants to sleep to wait for a mutex, instead of calling the libc function directly it first has to call into rtld to resolve the symbol. the rtld code will call _rtld_shared_enter(), which might also need to sleep using _lwp_park to wait for the rtld internal lock. the rtld internal usage of _lwp_park can accidentally consume an unpark from another thread that was intended for the libpthread code, and if that happens, then when rtld is done resolving the symbol and libpthread actually calls the real _lwp_park function, the unpark has been lost and the libpthread call to _lwp_park will sleep forever. the above patch simply resolves the symbol for the libpthread call to _lwp_park while the process is still single-threaded, by calling the _lwp_park to both unpark and park itself, which just returns immediately. after that, the libpthread calls to _lwp_park will no longer call into rtld, so attempts to unpark libpthread can no longer be lost. when I wrote that patch I thought it would be a complete fix, but upon reading the previous email threads about this problem I saw a mention of signals, and signal handlers can call a wide variety of functions, so this patch turns out to only fix what is probably the most common way that this problem manifests. the nature of lwp_park/unpark (with just a single "already unparked" flag per thread in the kernel) is such that they cannot safely be used in a nested fashion like they are in libpthread and rtld, so we need to change one or both of these callers to use some other primitive to implement sleeping to wait for a lock, such as futexes, which do not have this kind of per-thread flag that prevents safe nested usage. -Chuck
