Excerpts from Iain Buclaw's message of September 5, 2025 6:45 pm: > Excerpts from Iain Sandoe's message of September 3, 2025 9:59 pm: >> >> >>> On 3 Sep 2025, at 20:54, Iain Buclaw <ibuc...@gdcproject.org> wrote: >>> >>> Excerpts from Iain Buclaw's message of September 3, 2025 9:19 pm: >>>> Excerpts from Rainer Orth's message of September 3, 2025 10:20 am: >>>>>>> >>>>>>> I regularly (but not always) see timeouts on Solaris, both on sparc and >>>>>>> x86: >>>>>>> >>>>>>> WARNING: libphobos.gc/forkgc2.d execution test program timed out. >>>>>>> FAIL: libphobos.gc/forkgc2.d execution test >>>>>>> WARNING: libphobos.gc/startbackgc.d execution test program timed out. >>>>>>> FAIL: libphobos.gc/startbackgc.d execution test >>>>> >>>>> I haven't tried investigating what's wrong on Solaris with those two, >>>>> but they sure are annoying, especially since they are so unreliable: >>>>> sometimes both PASS, sometimes one or the other, sometimes both. >>>>> >>>>> I'd thought about skipping them on Solaris, too, just to avoid the noise >>>>> and the timeouts, but haven't gotten around to that. >>>>> >>>>> However, fixing this at the root would certainly be best. >>>>> >>>> >>>> I currently have a gdb session on cfarm, process has hung for forkgc2, >>>> and just looking at the backtrace. >>>> >>>> * There are 11 threads in total (main + 10 new'd Threads) >>>> * All threads are suspended (in sigsuspend) except for two >>>> * The first of those threads is the one that's requested all threads to >>>> suspend using pthread_kill(SIGRTMIN), and is stuck inside a sem_wait >>>> for one more call to sem_post(). >>>> * The second is stuck in a SpinLock.lock loop, called from >>>> _prefork_handler() inside forkx() inside fork() - my guess would be >>>> the handler being called is _d_gcx_atfork_prepare(). >>>> * Specific to Solaris, I've clocked this line in the forkx >>>> implementation: >>>> >>>> https://github.com/illumos/illumos-gate/blob/a21856a054bd854f39d1d55a6b0d547cb0d2039f/usr/src/lib/libc/port/threads/scalls.c#L177 >>>> >>>> I think what's going on is that the thread that wants to do a GC >>>> collection has issued a signal to all threads, but Solaris has called >>>> sigoff() in the last thread being fork'd, so the signal never reaches. >>>> >>>> This behaviour does not change when COLLECT_FORK is disabled, so Solaris >>>> would still be affected. >>>> >>> >>> I forgot to mention, thread #1 that wants to do a GC has control of the >>> SpinLock. So that's why thread #2 is stuck in its current loop. >>> >>> The order of operations that lead to Solaris hanging at runtime are: >>> 1. Thread #1 calls GC.lockNR() and has hold of the global GC SpinLock. >>> 2. Thread #2 calls fork(). It too calls GC.lockNR() in >>> _d_gcx_atfork_prepare() and is waiting for the global lock. >>> 3. Thread #1 decides to call thread_suspendAll() and will never release >>> the GC lock until all threads are suspended. >>> 4. Thread #2 will never suspend because Solaris has set sigoff() on it >>> until the pthread_atfork prepare handler has returned (it won't). >>> >>> It would appear that there should be some other fine grained lock to >>> prevent this kind of deadlock. >> >> It’s not impossible to imagine something similar happening for Darwin. >> (i.e. masking signals during thread startup) - but I did not poke at the >> sources so far. >> Iain >> > > @Rainers I've synthesised this in a C program, the minimum logic more or > less copied from druntime itself. > > https://gist.github.com/ibuclaw/3e57a4f7690012f49834a7442977b28b > > On Solaris/SPARC, I get a hang in the same manner as I described once > every 5 or so runs. > > Interestingly, disabling the "GC" from installing atfork prepare > handlers does not remove the chance of a deadlock occurring (maybe one > in every 20 runs), as it would appear that sema_wait() and fork() have > low level libc lock in common. > > The implementation that is free of deadlocks is to use thr_suspend and > thr_continue instead. However, this can only work with Druntime on > Solaris if there is also a function available to get a given thread's > stack and registers for the GC to scan. > > There is such a function here, but it would appear to be deprecated / up > for removal once some ancient version of Java is no longer supported. > > https://github.com/illumos/illumos-gate/blob/80040569a359c61120972d882d97428e80dcab90/usr/src/lib/libc/port/threads/thr.c#L2477-L2496 >
@Rainers, I might have found the solution. It turns out that fork() and thr_suspend/thr_continue() have a lock in common - one cannot proceed without the other releasing. So I think the correct fix would be to do something like this in druntime's suspend() function: thr_suspend(t.id); pthread_kill(t.id, suspendSignal); // or thr_kill thr_continue(t.id); Is there any reason to suggest otherwise? Iain.