[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2022-12-30 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Iain Buclaw  changed:

   What|Removed |Added

 CC||bra...@puremagic.com

--- Comment #31 from Iain Buclaw  ---
*** Issue 13416 has been marked as a duplicate of this issue. ***

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2022-12-30 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #30 from Iain Buclaw  ---
*** Issue 10351 has been marked as a duplicate of this issue. ***

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2022-12-30 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Iain Buclaw  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||ibuc...@gdcproject.org
 Resolution|--- |FIXED

--- Comment #29 from Iain Buclaw  ---
PR got merged.

https://github.com/dlang/druntime/pull/3617

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2021-11-09 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Dlang Bot  changed:

   What|Removed |Added

   Keywords||pull

--- Comment #28 from Dlang Bot  ---
@hatf0 updated dlang/druntime pull request #3617 "Move SIGUSR1/SIGUSR2 to SIGRT
for GC" fixing this issue:

- Fix Issue 15939 -- Move SIGUSR1/SIGUSR2 to SIGRT for GC

https://github.com/dlang/druntime/pull/3617

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2019-04-12 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Илья Ярошенко  changed:

   What|Removed |Added

   Assignee|ilyayaroshe...@gmail.com|nob...@puremagic.com

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-12-17 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

safety0ff.bugz  changed:

   What|Removed |Added

   See Also||https://issues.dlang.org/sh
   ||ow_bug.cgi?id=16979

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-10-07 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #27 from Martin Nowak  ---
(In reply to Илья Ярошенко from comment #26)
> Probably related issue
> http://forum.dlang.org/post/igqwbqawrtxnigplg...@forum.dlang.org

No, looks like an unrelated crash in a finalizer.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-10-04 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #26 from Илья Ярошенко  ---
Probably related issue
http://forum.dlang.org/post/igqwbqawrtxnigplg...@forum.dlang.org

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-09-23 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #25 from Martin Nowak  ---
(In reply to Aleksei Preobrazhenskii from comment #24)
> Since I changed signals to real-time and migrated to recent kernel I haven't
> seen that issue in the release builds, however, I tried running profile
> build recently (unfortunately I only did it for the old kernel) and it was
> consistently stuck every time.

Thanks, good to hear from you.

There is a chance that these are kernel bugs fixed in 3.10
https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db
and 3.18
https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-09-22 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #24 from Aleksei Preobrazhenskii  ---
(In reply to Martin Nowak from comment #23)
> Anyone still experiencing this issue? Can't seem to fix it w/o reproducing
> it.

Since I changed signals to real-time and migrated to recent kernel I haven't
seen that issue in the release builds, however, I tried running profile build
recently (unfortunately I only did it for the old kernel) and it was
consistently stuck every time. It might be something related to the issue, I
will try to reproduce it with simpler code when I have time.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-09-22 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #23 from Martin Nowak  ---
Anyone still experiencing this issue? Can't seem to fix it w/o reproducing it.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-08-11 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #22 from Илья Ярошенко  ---
(In reply to Martin Nowak from comment #21)
> Nope, that doesn't seem to be the problem.
> All the thread exit code synchronizes on Thread.slock_nothrow.
> It shouldn't even be possible to send a signal to an exiting thread, b/c
> they get removed from the thread list before that, and that is synchronized
> around the suspend loop.
> 
> Might still be a problem with the synchronization of m_isRunning and/or
> thread_cleanupHandler. Did your apps by any chance use thread cancellation
> or pthread_exit?

No, but an Exception may be thrown in a thread.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-08-11 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #21 from Martin Nowak  ---
Nope, that doesn't seem to be the problem.
All the thread exit code synchronizes on Thread.slock_nothrow.
It shouldn't even be possible to send a signal to an exiting thread, b/c they
get removed from the thread list before that, and that is synchronized around
the suspend loop.

Might still be a problem with the synchronization of m_isRunning and/or
thread_cleanupHandler. Did your apps by any chance use thread cancellation or
pthread_exit?

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-08-10 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #20 from Илья Ярошенко  ---
I have not access to the source code anymore :/

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-08-09 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #19 from Martin Nowak  ---
(In reply to Илья Ярошенко from comment #17)
> > https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> > 2.070.0.
> > Could someone test their code with 2.069.2?
> 
> Yes, the bug was found first for 2.069.

But that change is not in 2.069.x, only in 2.070.0 and following.
Can you somewhat reproduce it? Would simplify my life a lot.

Following my hypothesis, it should be fairly simple to trigger with one thread
continuously looping on GC.collect(), while concurrently spawning many short
lived threads, to increase the change of triggering the race between signal
delivery and the thread exiting.

If realtime signals are delivered faster (before pthread_kill returns), then
they might indeed avoid the race condition by pure chance.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-08-09 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #17 from Илья Ярошенко  ---
(In reply to Martin Nowak from comment #16)
> (In reply to Aleksei Preobrazhenskii from comment #13)
> > All suspending signals were delivered, but it seems that number of calls to
> > sem_wait was different than number of calls to sem_post (or something
> > similar). I have no reasonable explanation for that.
> > 
> > It doesn't invalidate the hypothesis that RT signals helped with original
> > deadlock though.
> 
> To be hypothesis it must verifyable, but as we can't explain why RT signals
> would help, it's not a real hypothesis. Can anyone somewhat repeatedly
> reproduce the issue?

It is not easy to catch it on PC. The bug was found when program was running on
multiple CPUs on multiple servers during a day.

> I would suspect that this issue came with the recent parallel suspend
> feature.
> https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> 2.070.0.
> Could someone test their code with 2.069.2?

Yes, the bug was found first for 2.069.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-08-09 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #16 from Martin Nowak  ---
(In reply to Aleksei Preobrazhenskii from comment #13)
> All suspending signals were delivered, but it seems that number of calls to
> sem_wait was different than number of calls to sem_post (or something
> similar). I have no reasonable explanation for that.
> 
> It doesn't invalidate the hypothesis that RT signals helped with original
> deadlock though.

To be hypothesis it must verifyable, but as we can't explain why RT signals
would help, it's not a real hypothesis. Can anyone somewhat repeatedly
reproduce the issue?
I would suspect that this issue came with the recent parallel suspend feature.
https://github.com/dlang/druntime/pull/1110, that would affect dmd >= 2.070.0.
Could someone test their code with 2.069.2?

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-05-20 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Artem Tarasov  changed:

   What|Removed |Added

 CC||lomerei...@gmail.com

--- Comment #15 from Artem Tarasov  ---
I'm apparently bumping into the same problem. Here's the last stack trace that
I've received from a user, very similar to the one posted here:
https://gist.github.com/rtnh/e2eab6afa7c0a37dbc96578d0f73c540

The prominent kernel bug mentioned here has been ruled out already. Another
hint I've got is that reportedly 'error doesn't happen on XenServer
hypervisors, only on KVM' (full discussion is taking place at
https://github.com/lomereiter/sambamba/issues/189)

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-05-12 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #14 from safety0ff.bugz  ---
(In reply to Aleksei Preobrazhenskii from comment #13)
> 
> All suspending signals were delivered, but it seems that number of calls to
> sem_wait was different than number of calls to sem_post (or something
> similar). I have no reasonable explanation for that.
>
> It doesn't invalidate the hypothesis that RT signals helped with original
> deadlock though.

I haven't looked too closely at whether there's any races for thread
termination.
My suspicions are still on a low-level synchronization bug.
Have you tried a more recent (3.19+ kernel) or trying to newer glibc?

I'm aware of this bug [1] which was supposed to affect kernels 3.14 - 3.18 but
perhaps there's a preexisting bug which affects your machine?

[1] https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-05-11 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #13 from Aleksei Preobrazhenskii  ---
I saw new deadlock with different symptoms today. 

Stack trace of collecting thread:

Thread XX (Thread 0x7fda6700 (LWP 32383)):
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x007b4046 in thread_suspendAll ()
#2  0x007998dd in gc.gc.Gcx.fullcollect() ()
#3  0x00797e24 in gc.gc.Gcx.bigAlloc() ()
#4  0x0079bb5f in
gc.gc.GC.__T9runLockedS47_D2gc2gc2GC12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS21_D2gc2gc10mallocTimelS21_D2gc2gc10numMallocslTmTkTmTxC8TypeInfoZ.runLocked()
()
#5  0x0079548e in gc.gc.GC.malloc() ()
#6  0x00760ac7 in gc_qalloc ()
#7  0x0076437b in _d_arraysetlengthT ()
...application stack

Stack traces of other threads:

Thread XX (Thread 0x7fda5cff9700 (LWP 32402)):
#0  0x7fda78927454 in do_sigsuspend (set=0x7fda5cff76c0) at
../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1  __GI___sigsuspend (set=) at
../sysdeps/unix/sysv/linux/sigsuspend.c:78
#2  0x0075d979 in core.thread.thread_suspendHandler() ()
#3  0x0075e220 in core.thread.callWithStackShell() ()
#4  0x0075d907 in thread_suspendHandler ()
#5  
#6  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:160
#7  0x00760069 in core.sync.condition.Condition.wait() ()
...application stack


All suspending signals were delivered, but it seems that number of calls to
sem_wait was different than number of calls to sem_post (or something similar).
I have no reasonable explanation for that.

It doesn't invalidate the hypothesis that RT signals helped with original
deadlock though.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-05-09 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #12 from Aleksei Preobrazhenskii  ---
(In reply to Martin Nowak from comment #11)
> Did you have gdb attached while the signal was send? That sometime causes
> issues w/ signal delivery.

No, I didn't. I attached gdb to investigate deadlock which already happened at
that point.

> Are there any other reasons for switching to real-time signals?

I read that traditional signals are internally mapped to real-time signals. If
that's true I see no reason to stick with inferior emulated entity with weaker
guarantees.

> Which real-time signals are usually not used for other purposes?

Basically all real-time signals from range SIGRTMIN .. SIGRTMAX are intended
for custom use (SIGRTMIN might vary from platform to platform though, because
of things like NPTL and LinuxThreads).

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-05-08 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #11 from Martin Nowak  ---
Having the main thread hang while waiting for semaphore posts in the
thread_suspendAll is a good indication that the signal was lost.
Did you have gdb attached while the signal was send? That sometime causes
issues w/ signal delivery.
The setup looks fairly simple (a few threads allocating classes and extending
arrays) to be run for a few days, maybe we can reproduce the problem.

Are there any other reasons for switching to real-time signals?
Which real-time signals are usually not used for other purposes?

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-05-07 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Илья Ярошенко  changed:

   What|Removed |Added

   Assignee|nob...@puremagic.com|ilyayaroshe...@gmail.com

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-27 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #10 from Aleksei Preobrazhenskii  ---
(In reply to safety0ff.bugz from comment #9)
> Could you run strace to get a log of the signal usage?

I did it before to catch the deadlock, but I wasn't able to do that while
strace was running. And, unfortunately, I don't have original code running in
production anymore.

> I'm wondering if there are any other signal handler invocations in the
> "...application stack" part of your stack traces.

No, there was no signal related code in hidden parts of stack trace.

> I've seem a deadlock caused by an assert firing within the
> thread_suspendHandler, which deadlocks on the GC lock.

In my case that was a release build, so I assume no asserts.

> What should happen in this case is since the signal is masked upon signal
> handler invocation, the new suspend signal is marked as "pending" and run
> once thread_suspendHandler returns and the signal is unblocked.

Yeah, my reasoning was wrong. I did a quick test and saw that signals weren't
delivered, apparently, I forgot that pthread_kill is asynchronous, so signals
should've coalesced in my test.

> Their queuing and ordering guarantees should be irrelevant due to 
> synchronization and signal masks.

Ideally, yeah, but as I said, I just changed SIGUSR1/SIGUSR2 to
SIGRTMIN/SIGRTMIN+1 and didn't see any deadlocks for a long time, and I saw
them pretty consistently before. So, either "irrelevant" part is wrong, or
there is something else which is different and relevant (and probably not
documented) for real-time signals. The other explanation is that bug is still
there and real-time signals just somehow reduced probability of it happening.

Also, I have no other explanation why stack traces look like that, the simplest
one is that signal wasn't delivered.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-27 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

safety0ff.bugz  changed:

   What|Removed |Added

 CC||safety0ff.b...@gmail.com

--- Comment #9 from safety0ff.bugz  ---
Could you run strace to get a log of the signal usage?

For example:

strace -f -e signal -o signals.log command_to_run_program

Then add the output signals.log to the bug report?
I don't know if it'll be useful but it will be something more to look for
hints.

I'm wondering if there are any other signal handler invocations in the
"...application stack" part of your stack traces.
I've seem a deadlock caused by an assert firing within the
thread_suspendHandler, which deadlocks on the GC lock.

(In reply to Aleksei Preobrazhenskii from comment #6)
> Like, if thread_suspendAll happens while some threads are still in the 
> thread_suspendHandler (already handled resume signal, but still didn't leave 
> the suspend handler).

What should happen in this case is since the signal is masked upon signal
handler invocation, the new suspend signal is marked as "pending" and run once
thread_suspendHandler returns and the signal is unblocked.

The suspended thread cannot receive another resume or suspend signal until
after the sem_post in thread_suspendHandler.

I've mocked up the suspend / resume code and it does not deadlock from the
situation you've described.

> Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery
> guarantees

Their queuing and ordering guarantees should be irrelevant due to 
synchronization and signal masks.

I don't see any other benefits of RT signals.

(In reply to Walter Bright from comment #8)
> 
> Since you've written the code to fix it, please write a Pull Request for it.
> That way you get the credit!

He modified his code to use the thread_setGCSignals function:
https://dlang.org/phobos/core_thread.html#.thread_setGCSignals


P.S.: I don't mean to sound doubtful, I just want a sound explanation of the
deadlock so it can be properly address at the cause.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-26 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Walter Bright  changed:

   What|Removed |Added

 CC||bugzi...@digitalmars.com

--- Comment #8 from Walter Bright  ---
(In reply to Aleksei Preobrazhenskii from comment #7)
> I was running tests for past five days, I didn't see any deadlocks since I
> switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN,
> SIGRTMIN + 1)). I would recommend to change default signals accordingly.

Since you've written the code to fix it, please write a Pull Request for it.
That way you get the credit!

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-26 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Vladimir Panteleev  changed:

   What|Removed |Added

 CC||c...@dawg.eu,
   ||thecybersha...@gmail.com

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-25 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #7 from Aleksei Preobrazhenskii  ---
I was running tests for past five days, I didn't see any deadlocks since I
switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN,
SIGRTMIN + 1)). I would recommend to change default signals accordingly.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-20 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Aleksei Preobrazhenskii  changed:

   What|Removed |Added

   See Also||https://issues.dlang.org/sh
   ||ow_bug.cgi?id=10351

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-20 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #6 from Aleksei Preobrazhenskii  ---
I think I saw the same behaviour in debug builds, I will try to verify it. As
of 32-bit question, due to the nature of the program I can't test it in 32-bit
environment.

After investigating problem a little further, I think that the issue might be
in GC relying on traditional POSIX signals. One way to get such stack traces is
if suspend signal (SIGUSR1 by default) wasn't delivered, which could happen for
traditional POSIX signals if they occur in quick succession. Like, if
thread_suspendAll happens while some threads are still in the
thread_suspendHandler (already handled resume signal, but still didn't leave
the suspend handler).

Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery
guarantees, I'm going to try the same code but with
thread_setGCSignals(SIGRTMIN, SIGRTMIN + 1).

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-20 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #5 from Sobirari Muhomori  ---
Also what about 32-bit mode?

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-20 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #4 from Sobirari Muhomori  ---
(In reply to Aleksei Preobrazhenskii from comment #0)
> dmd 2.071.0 with -O -release -inline -boundscheck=off

Do these flags affect the hang?

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-19 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Ivan Kazmenko  changed:

   What|Removed |Added

 CC||ga...@mail.ru

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-19 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Marco Leise  changed:

   What|Removed |Added

 CC||marco.le...@gmx.de

--- Comment #3 from Marco Leise  ---
This issue has a smell of https://issues.dlang.org/show_bug.cgi?id=10351
In absence of a repro case that works without the profiler I just kept it open
for future reference. Not how the GC hangs in thread_suspendAll() in both
cases.

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-19 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

Илья Ярошенко  changed:

   What|Removed |Added

 CC||ilyayaroshe...@gmail.com

--- Comment #2 from Илья Ярошенко  ---
+1
I had the same problems

--


[Issue 15939] GC.collect causes deadlock in multi-threaded environment

2016-04-18 Thread via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #1 from Aleksei Preobrazhenskii  ---
I wasn't able to reproduce the issue using simpler code using GC operations
only. I noticed that nanosleep is a syscall which should be interrupted by GC
signal. So, probably there is something else involved aside from GC. I use
standard library only and I have no custom signal-related code.

--