[perl #129949] [BUG] Failure to close channel causes GC crash

jn...@jnthn.net via RT Tue, 01 Nov 2016 07:40:35 -0700

On Sun Oct 23 18:01:55 2016, d...@theschrags.net wrote:
> This probably is an issue with moarvm.
> 
It was indeed.


> When trying to benchmark the concurrent and non-concurrent versions of 
> Damien Conway's 'bogosort' algorithm, I ran into numerous problems with 
> Rakudo Star 2016.07 - depending on the MAX_RAKUDO_THREADS setting, the 
> process might succeed, or it might slowly eat up all of memory, and if 
> more than a few threads were used only some would actually process 
> anything (accumulate CPU time). rakudo-star-2016.10-RC0 looks like it 
> solved the deadlock problem, so that all threads are actually working.
> 
> However, now this version of the program crashes after the first thread 
> completes with:
>    Internal error: zeroed target thread ID in work pass
> 
> or occasionally:
>    Internal error: invalid thread ID 8770096 in GC work pass
> 
> Test runs:
> 
> doug@ender:~/study/Perl6/benchmarks$ RAKUDO_MAX_THREADS=2 perl6 
> bench_bogosort_simple.pl6
> [e i l p r s x]
> 1 seconds.
> Internal error: invalid thread ID 8770096 in GC work pass
> doug@ender:~/study/Perl6/benchmarks$ RAKUDO_MAX_THREADS=2 perl6 
> bench_bogosort_simple.pl6
> [e i l p r s x]
> 4 seconds.
> Internal error: zeroed target thread ID in work pass
> doug@ender:~/study/Perl6/benchmarks$ RAKUDO_MAX_THREADS=2 perl6 
> bench_bogosort_simple.pl6
> [e i l p r s x]
> 24 seconds.
> Internal error: zeroed target thread ID in work pass
> doug@ender:~/study/Perl6/benchmarks$ RAKUDO_MAX_THREADS=2 perl6 
> bench_bogosort_simple.pl6
> [e i l p r s x]
> 6 seconds.
> Internal error: zeroed target thread ID in work pass
> 
> 
> I also found that the problem is likely related to a channel not being 
> explicitly closed, so I was able to get normal test runs as follows:
> 
> [ With a LEAVE block to close the channels ]
> 
> doug@ender:~/study/Perl6/benchmarks$ RAKUDO_MAX_THREADS=2 perl6 
> bench_bogosort_simple.pl6
> [e i l p r s x]
> 4 seconds.
> [e i l p r s x]
> 29 seconds.
> [e i l p r s x]
> 57 seconds.
> [e i l p r s x]
> 58 seconds.
> doug@ender:~/study/Perl6/benchmarks$ RAKUDO_MAX_THREADS=2 perl6 
> bench_bogosort_simple.pl6
> [e i l p r s x]
> 4 seconds.
> [e i l p r s x]
> 10 seconds.
> [e i l p r s x]
> 20 seconds.
> [e i l p r s x]
> 31 seconds.
> 
> I'm figuring that failing to close the channels is bad practice to start, 
> but we don't really want to crash the VM with an obscure error message. 
> Also, I saw a bug report earlier this year where this message was 
> produced, but closed because of uncertainty in reproducing it. This test, 
> however fails quite repeatably, at least on my system (linux).
> 
The channel really does want closing, otherwise you end up with a bunch of 
threads sat in a hot loop eating CPU (and each run sets off another thread 
doing exactly that, which is why the thing gets slower every run with the 
missing Channel.close). Of course, a VM crash is the wrong response; I've now 
hunted that down and committed a fix.

It turns out the bug wasn't actually anything to do with closing the channel 
per se; it's just that missing the `LEAVE` out resulted in a lot more load and 
GC runs. The problem almost certainly could have happened with the `LEAVE` 
there; it was just a lot less likely.

I've added this example as a stresstest, to catch any regressions.

Thanks!

/jnthn

[perl #129949] [BUG] Failure to close channel causes GC crash

Reply via email to