#3838: Performance issues with blackholes
---------------------------------+------------------------------------------
Reporter: simonmar | Owner: simonmar
Type: bug | Status: new
Priority: normal | Milestone: 6.14 branch
Component: Runtime System | Version: 6.12.1
Keywords: | Difficulty:
Os: Unknown/Multiple | Testcase:
Architecture: Unknown/Multiple | Failure: Runtime performance bug
---------------------------------+------------------------------------------
Description changed by simonmar:
Old description:
> The attached program demonstrates an acute performance problem when many
> threads are blocking on blackholes.
>
> To run it:
>
> {{{
> $ cd event
> $ cabal configure && cabal build
> $ cd benchmarks
> $ make thread-delay
> $ ./thread-delay -n 20000
> }}}
>
> often the program completes in under a second, but sometimes it takes
> much longer (15s up to a minute).
>
> I diagnosed the problem: in this program, thousands of threads compete to
> update a single `IORef` using `atomicModifyIORef`. If one of these
> threads is pre-empted while evaluating the thunk left in the `IORef` by
> `atomicModifyIORef`, then the thunk becomes a blackhole. From that point
> on, all the other threads become blocked on blackholes, as they each call
> `atomicModifyIORef` and evaluate the result.
>
> Eventually there are thousands of threads on the blackhole queue, and
> since this queue is traversed at least once and possibly several times
> during GC, performance drops severely.
>
> Some thoughts I've had follow. There isn't an obvious good solution yet.
>
> * We can't attach blocked threads to the blackhole directly, due to the
> race conditions this would cause with parallel execution (we replaced the
> blackhole blocking queues with the global blackhole queue in 6.6 when
> implementing parallel execution).
>
> * maintaining a hash-table mapping blackholes to blocked threads would
> improve wakeup times, but wouldn't help here, because we'd then have a
> large hash table to update on each GC. I have an unpushed patch to do
> just this sitting around (attached); it didn't improve performance when I
> tried it, and is rather more complicated than the current design.
>
> * we could divide the blackhole queue into per-generation queues as
> we've done with several other data structures. This would help GC to
> scale, but doesn't address the root problems.
>
> * If instead of blackholing the thunk we suspended it (ala
> `raiseAsync`), that would solve the problem. It could be done in
> `threadPaused`, but how do we know when to do this?
>
> * Another solution along similar lines would be to just not blackhole
> the thunk at all: we let other threads evaluate it again. This would
> make sense for thunks with a fixed known-small evaluation cost, such as
> selectors. Unfortunately in this case the thunk in the `IORef` is not
> known to be small, although the two selector thunks also created by
> `atomicModifyIORef` do fall into this category.
>
> * If we knew which threads owned which blackholes, then we could arrange
> to schedule threads on which others were blocked more quickly. This is
> like a priority-inversion problem: one thread is blocking all the others,
> we need to increase its priority. Unfortunately we don't have a mapping
> from blackholes to threads available, and maintaining it would introduce
> an overhead. In this case we have a cascade of threads blocked on each
> other, so choosing the right scheduling is particularly difficult.
>
> * Using STM would avoid the problem, as long as the program is careful
> to not store any thunks in a `TVar`. However, STM is rather more
> expensive, and the reason for using `atomicModifyIORef` in the first
> place was speed.
>
> * Using `MVar` would ''not'' solve the problem, as `modifyMVar` can be
> pre-empted while holding the `MVar`, which would lead to a similar
> problem (in fact, it would be impossible for the RTS to do anything in
> this case, since we can't revert an `MVar` or know which thread is
> holding it).
New description:
The attached program demonstrates an acute performance problem when many
threads are blocking on blackholes.
To run it:
{{{
$ cd event
$ cabal configure && cabal build
$ cd benchmarks
$ make thread-delay
$ ./thread-delay -n 20000
}}}
often the program completes in under a second, but sometimes it takes much
longer (15s up to a minute).
I diagnosed the problem: in this program, thousands of threads compete to
update a single `IORef` using `atomicModifyIORef`. If one of these
threads is pre-empted while evaluating the thunk left in the `IORef` by
`atomicModifyIORef`, then the thunk becomes a blackhole. From that point
on, all the other threads become blocked on blackholes, as they each call
`atomicModifyIORef` and evaluate the result.
Eventually there are thousands of threads on the blackhole queue, and
since this queue is traversed at least once and possibly several times
during GC, performance drops severely.
Some thoughts I've had follow. There isn't an obvious good solution yet.
* We can't attach blocked threads to the blackhole directly, due to the
race conditions this would cause with parallel execution (we replaced the
blackhole blocking queues with the global blackhole queue in 6.6 when
implementing parallel execution).
* maintaining a hash-table mapping blackholes to blocked threads would
improve wakeup times, but wouldn't help here, because we'd then have a
large hash table to update on each GC. I have an unpushed patch to do
just this sitting around (attached); it didn't improve performance when I
tried it, and is rather more complicated than the current design.
* we could divide the blackhole queue into per-generation queues as we've
done with several other data structures. This would help GC to scale, but
doesn't address the root problems.
* If instead of blackholing the thunk we suspended it (ala `raiseAsync`),
that would solve the problem. It could be done in `threadPaused`, but how
do we know when to do this?
* Another solution along similar lines would be to just not blackhole the
thunk at all: we let other threads evaluate it again. This would make
sense for thunks with a fixed known-small evaluation cost, such as
selectors. Unfortunately in this case the thunk in the `IORef` is not
known to be small, although the two selector thunks also created by
`atomicModifyIORef` do fall into this category. (Note however that
blackholing is currently mandatory due to a problem that otherwise occurs
in parallel GC: see `NOTE [upd-black-hole]` in
[[GhcFile(rts/sm/Scav.c)]]).
* If we knew which threads owned which blackholes, then we could arrange
to schedule threads on which others were blocked more quickly. This is
like a priority-inversion problem: one thread is blocking all the others,
we need to increase its priority. Unfortunately we don't have a mapping
from blackholes to threads available, and maintaining it would introduce
an overhead. In this case we have a cascade of threads blocked on each
other, so choosing the right scheduling is particularly difficult.
* Using STM would avoid the problem, as long as the program is careful to
not store any thunks in a `TVar`. However, STM is rather more expensive,
and the reason for using `atomicModifyIORef` in the first place was speed.
* Using `MVar` would ''not'' solve the problem, as `modifyMVar` can be
pre-empted while holding the `MVar`, which would lead to a similar problem
(in fact, it would be impossible for the RTS to do anything in this case,
since we can't revert an `MVar` or know which thread is holding it).
--
--
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/3838#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs