#3838: Performance issues with blackholes
---------------------------------+------------------------------------------
    Reporter:  simonmar          |        Owner:  simonmar               
        Type:  bug               |       Status:  new                    
    Priority:  normal            |    Milestone:  6.14 branch            
   Component:  Runtime System    |      Version:  6.12.1                 
    Keywords:                    |   Difficulty:                         
          Os:  Unknown/Multiple  |     Testcase:                         
Architecture:  Unknown/Multiple  |      Failure:  Runtime performance bug
---------------------------------+------------------------------------------
Description changed by simonmar:

Old description:

> The attached program demonstrates an acute performance problem when many
> threads are blocking on blackholes.
>
> To run it:
>
> {{{
> $ cd event
> $ cabal configure && cabal build
> $ cd benchmarks
> $ make thread-delay
> $ ./thread-delay -n 20000
> }}}
>
> often the program completes in under a second, but sometimes it takes
> much longer (15s up to a minute).
>
> I diagnosed the problem: in this program, thousands of threads compete to
> update a single `IORef` using `atomicModifyIORef`.  If one of these
> threads is pre-empted while evaluating the thunk left in the `IORef` by
> `atomicModifyIORef`, then the thunk becomes a blackhole. From that point
> on, all the other threads become blocked on blackholes, as they each call
> `atomicModifyIORef` and evaluate the result.
>
> Eventually there are thousands of threads on the blackhole queue, and
> since this queue is traversed at least once and possibly several times
> during GC, performance drops severely.
>
> Some thoughts I've had follow.  There isn't an obvious good solution yet.
>
>  * We can't attach blocked threads to the blackhole directly, due to the
> race conditions this would cause with parallel execution (we replaced the
> blackhole blocking queues with the global blackhole queue in 6.6 when
> implementing parallel execution).
>
>  * maintaining a hash-table mapping blackholes to blocked threads would
> improve wakeup times, but wouldn't help here, because we'd then have a
> large hash table to update on each GC.  I have an unpushed patch to do
> just this sitting around; it didn't improve performance when I tried it,
> and is rather more complicated than the current design.
>
>  * we could divide the blackhole queue into per-generation queues as
> we've done with several other data structures.  This would help GC to
> scale, but doesn't address the root problems.
>
>  * If instead of blackholing the thunk we suspended it (ala
> `raiseAsync`), that would solve the problem.  It could be done in
> `threadPaused`, but how do we know when to do this?
>
>  * Another solution along similar lines would be to just not blackhole
> the thunk at all: we let other threads evaluate it again.  This would
> make sense for thunks with a fixed known-small evaluation cost, such as
> selectors.  Unfortunately in this case the thunk in the `IORef` is not
> known to be small, although the two selector thunks also created by
> `atomicModifyIORef` do fall into this category.
>
>  * If we knew which threads owned which blackholes, then we could arrange
> to schedule threads on which others were blocked more quickly.  This is
> like a priority-inversion problem: one thread is blocking all the others,
> we need to increase its priority.  Unfortunately we don't have a mapping
> from blackholes to threads available, and maintaining it would introduce
> an overhead.  In this case we have a cascade of threads blocked on each
> other, so choosing the right scheduling is particularly difficult.
>
>  * Using STM would avoid the problem, as long as the program is careful
> to not store any thunks in a `TVar`.  However, STM is rather more
> expensive, and the reason for using `atomicModifyIORef` in the first
> place was speed.
>
>  * Using `MVar` would ''not'' solve the problem, as `modifyMVar` can be
> pre-empted while holding the `MVar`, which would lead to a similar
> problem (in fact, it would be impossible for the RTS to do anything in
> this case, since we can't revert an `MVar` or know which thread is
> holding it).

New description:

 The attached program demonstrates an acute performance problem when many
 threads are blocking on blackholes.

 To run it:

 {{{
 $ cd event
 $ cabal configure && cabal build
 $ cd benchmarks
 $ make thread-delay
 $ ./thread-delay -n 20000
 }}}

 often the program completes in under a second, but sometimes it takes much
 longer (15s up to a minute).

 I diagnosed the problem: in this program, thousands of threads compete to
 update a single `IORef` using `atomicModifyIORef`.  If one of these
 threads is pre-empted while evaluating the thunk left in the `IORef` by
 `atomicModifyIORef`, then the thunk becomes a blackhole. From that point
 on, all the other threads become blocked on blackholes, as they each call
 `atomicModifyIORef` and evaluate the result.

 Eventually there are thousands of threads on the blackhole queue, and
 since this queue is traversed at least once and possibly several times
 during GC, performance drops severely.

 Some thoughts I've had follow.  There isn't an obvious good solution yet.

  * We can't attach blocked threads to the blackhole directly, due to the
 race conditions this would cause with parallel execution (we replaced the
 blackhole blocking queues with the global blackhole queue in 6.6 when
 implementing parallel execution).

  * maintaining a hash-table mapping blackholes to blocked threads would
 improve wakeup times, but wouldn't help here, because we'd then have a
 large hash table to update on each GC.  I have an unpushed patch to do
 just this sitting around (attached); it didn't improve performance when I
 tried it, and is rather more complicated than the current design.

  * we could divide the blackhole queue into per-generation queues as we've
 done with several other data structures.  This would help GC to scale, but
 doesn't address the root problems.

  * If instead of blackholing the thunk we suspended it (ala `raiseAsync`),
 that would solve the problem.  It could be done in `threadPaused`, but how
 do we know when to do this?

  * Another solution along similar lines would be to just not blackhole the
 thunk at all: we let other threads evaluate it again.  This would make
 sense for thunks with a fixed known-small evaluation cost, such as
 selectors.  Unfortunately in this case the thunk in the `IORef` is not
 known to be small, although the two selector thunks also created by
 `atomicModifyIORef` do fall into this category.

  * If we knew which threads owned which blackholes, then we could arrange
 to schedule threads on which others were blocked more quickly.  This is
 like a priority-inversion problem: one thread is blocking all the others,
 we need to increase its priority.  Unfortunately we don't have a mapping
 from blackholes to threads available, and maintaining it would introduce
 an overhead.  In this case we have a cascade of threads blocked on each
 other, so choosing the right scheduling is particularly difficult.

  * Using STM would avoid the problem, as long as the program is careful to
 not store any thunks in a `TVar`.  However, STM is rather more expensive,
 and the reason for using `atomicModifyIORef` in the first place was speed.

  * Using `MVar` would ''not'' solve the problem, as `modifyMVar` can be
 pre-empted while holding the `MVar`, which would lead to a similar problem
 (in fact, it would be impossible for the RTS to do anything in this case,
 since we can't revert an `MVar` or know which thread is holding it).

--

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/3838#comment:1>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to