> On 5 Sep 2025, at 17:45, Iain Buclaw <ibuc...@gdcproject.org> wrote:
> 
> Excerpts from Iain Sandoe's message of September 3, 2025 9:59 pm:
>> 
>> 
>>> On 3 Sep 2025, at 20:54, Iain Buclaw <ibuc...@gdcproject.org> wrote:
>>> 
>>> Excerpts from Iain Buclaw's message of September 3, 2025 9:19 pm:
>>>> Excerpts from Rainer Orth's message of September 3, 2025 10:20 am:
>>>>>>> 
>>>>>>> I regularly (but not always) see timeouts on Solaris, both on sparc and
>>>>>>> x86:
>>>>>>> 
>>>>>>> WARNING: libphobos.gc/forkgc2.d execution test program timed out.
>>>>>>> FAIL: libphobos.gc/forkgc2.d execution test
>>>>>>> WARNING: libphobos.gc/startbackgc.d execution test program timed out.
>>>>>>> FAIL: libphobos.gc/startbackgc.d execution test
>>>>> 
>>>>> I haven't tried investigating what's wrong on Solaris with those two,
>>>>> but they sure are annoying, especially since they are so unreliable:
>>>>> sometimes both PASS, sometimes one or the other, sometimes both.
>>>>> 
>>>>> I'd thought about skipping them on Solaris, too, just to avoid the noise
>>>>> and the timeouts, but haven't gotten around to that.
>>>>> 
>>>>> However, fixing this at the root would certainly be best.
>>>>> 
>>>> 
>>>> I currently have a gdb session on cfarm, process has hung for forkgc2, 
>>>> and just looking at the backtrace.
>>>> 
>>>> * There are 11 threads in total (main + 10 new'd Threads)
>>>> * All threads are suspended (in sigsuspend) except for two
>>>> * The first of those threads is the one that's requested all threads to 
>>>> suspend using pthread_kill(SIGRTMIN), and is stuck inside a sem_wait 
>>>> for one more call to sem_post().
>>>> * The second is stuck in a SpinLock.lock loop, called from 
>>>> _prefork_handler() inside forkx() inside fork() - my guess would be 
>>>> the  handler being called is _d_gcx_atfork_prepare().
>>>> * Specific to Solaris, I've clocked this line in the forkx 
>>>> implementation:
>>>> 
>>>> https://github.com/illumos/illumos-gate/blob/a21856a054bd854f39d1d55a6b0d547cb0d2039f/usr/src/lib/libc/port/threads/scalls.c#L177
>>>> 
>>>> I think what's going on is that the thread that wants to do a GC 
>>>> collection has issued a signal to all threads, but Solaris has called 
>>>> sigoff() in the last thread being fork'd, so the signal never reaches.
>>>> 
>>>> This behaviour does not change when COLLECT_FORK is disabled, so Solaris 
>>>> would still be affected.
>>>> 
>>> 
>>> I forgot to mention, thread #1 that wants to do a GC has control of the 
>>> SpinLock.  So that's why thread #2 is stuck in its current loop.
>>> 
>>> The order of operations that lead to Solaris hanging at runtime are:
>>> 1. Thread #1 calls GC.lockNR() and has hold of the global GC SpinLock.
>>> 2. Thread #2 calls fork(). It too calls GC.lockNR() in 
>>>  _d_gcx_atfork_prepare() and is waiting for the global lock.
>>> 3. Thread #1 decides to call thread_suspendAll() and will never release 
>>>  the GC lock until all threads are suspended.
>>> 4. Thread #2 will never suspend because Solaris has set sigoff() on it 
>>>  until the pthread_atfork prepare handler has returned (it won't).
>>> 
>>> It would appear that there should be some other fine grained lock to 
>>> prevent this kind of deadlock.
>> 
>> It’s not impossible to imagine something similar happening for Darwin.
>> (i.e. masking signals during thread startup) - but I did not poke at the
>> sources so far.
>> Iain
>> 
> 
> @Rainers I've synthesised this in a C program, the minimum logic more or 
> less copied from druntime itself.
> 
> https://gist.github.com/ibuclaw/3e57a4f7690012f49834a7442977b28b

For the record:

Although I have not been able to reproduce the issue with this C code, after
some discussion on irc and more debugging we came to the conclusion that the
forkgc2.d test is breaking the rules by using `exit(0)` instead of `_exit(0)`.

On x86_64 darwin17, I can reliably repeat the hanging test every time (with
the D version) and then, equally reliably, it passes with _exit(0).

So, from the Darwin perspective, I’m withdrawing the patch to disable
COLLECT_FORK since that will not solve this issue.

I still have to cater for missing `___fork()` on earlier Darwin versions, but 
that is
not urgent - and probably needs doing via a configure check.

unfortunately, for the reasons below this finding will likely not help Solaris.

Iain

> 
> On Solaris/SPARC, I get a hang in the same manner as I described once 
> every 5 or so runs.
> 
> Interestingly, disabling the "GC" from installing atfork prepare 
> handlers does not remove the chance of a deadlock occurring (maybe one 
> in every 20 runs), as it would appear that sema_wait() and fork() have 
> low level libc lock in common.
> 
> The implementation that is free of deadlocks is to use thr_suspend and 
> thr_continue instead.  However, this can only work with Druntime on 
> Solaris if there is also a function available to get a given thread's 
> stack and registers for the GC to scan.
> 
> There is such a function here, but it would appear to be deprecated / up 
> for removal once some ancient version of Java is no longer supported.
> 
> https://github.com/illumos/illumos-gate/blob/80040569a359c61120972d882d97428e80dcab90/usr/src/lib/libc/port/threads/thr.c#L2477-L2496
> 
> Iain.

Reply via email to