#3518: GHC GC rises greatly on -N8 compared to -N7
-------------------------------------+--------------------------------------
Reporter:  nccb                      |          Owner:                
    Type:  run-time performance bug  |         Status:  new           
Priority:  normal                    |      Component:  Runtime System
 Version:  6.10.4                    |       Severity:  normal        
Keywords:                            |       Testcase:                
      Os:  Linux                     |   Architecture:  x86           
-------------------------------------+--------------------------------------
 I've recently been benchmarking some parallel code on an 8-core x86
 machine.  I have graphed performance for -N1 through to -N8, but I often
 get a huge spike on -N8.  I can produce this fairly reliably with the
 attached program, that simply spawns 8 threads that each spin writing to a
 TVar.  Here is an example of the timings I get:


 {{{
 time ./Main +RTS -N1
 real    0m43.697s
 user    0m43.683s
 sys     0m0.016s

 time ./Main +RTS -N4
 real    0m24.424s
 user    1m13.621s
 sys     0m0.208s

 time ./Main +RTS -N7
 real    0m18.631s
 user    1m42.778s
 sys     0m0.260s

 time ./Main +RTS -N8
 real    0m37.859s
 user    3m38.506s
 sys     0m0.632s
 }}}

 Spot the odd one out!  It's not to do with having exactly 8 threads for 8
 cores -- I get this problem with ~150 threads on 8 cores.  When I run with
 -s, it seems like the GC is the reason:

 {{{
 time ./Main +RTS -N7 -s
    7,313,502,412 bytes allocated in the heap
      108,812,664 bytes copied during GC
           11,924 bytes maximum residency (100 sample(s))
           61,828 bytes maximum slop
                5 MB total memory in use (0 MB lost due to fragmentation)

   Generation 0:  4207 collections,     0 parallel,  1.18s,  1.17s elapsed
   Generation 1:   100 collections,    99 parallel,  0.18s,  0.04s elapsed

   Parallel GC work balance: 3.25 (285256 / 87661, ideal 7)

   Task  0 (worker) :  MUT time:   5.56s  ( 11.70s elapsed)
                       GC  time:   0.01s  (  0.01s elapsed)

   Task  1 (worker) :  MUT time:   6.65s  ( 11.70s elapsed)
                       GC  time:   1.16s  (  1.13s elapsed)

   Task  2 (worker) :  MUT time:   0.03s  ( 11.70s elapsed)
                       GC  time:   0.00s  (  0.00s elapsed)

   Task  3 (worker) :  MUT time:   0.00s  ( 11.70s elapsed)
                       GC  time:   0.00s  (  0.00s elapsed)

   Task  4 (worker) :  MUT time:   9.93s  ( 11.70s elapsed)
                       GC  time:   0.01s  (  0.01s elapsed)

   Task  5 (worker) :  MUT time:   6.97s  ( 11.70s elapsed)
                       GC  time:   0.00s  (  0.00s elapsed)

   Task  6 (worker) :  MUT time:   8.25s  ( 11.70s elapsed)
                       GC  time:   0.04s  (  0.02s elapsed)

   Task  7 (worker) :  MUT time:   6.79s  ( 11.72s elapsed)
                       GC  time:   0.00s  (  0.00s elapsed)

   Task  8 (worker) :  MUT time:   3.70s  ( 11.70s elapsed)
                       GC  time:   0.14s  (  0.03s elapsed)

   INIT  time    0.00s  (  0.00s elapsed)
   MUT   time   61.91s  ( 11.70s elapsed)
   GC    time    1.35s  (  1.21s elapsed)
   EXIT  time    0.01s  (  0.02s elapsed)
   Total time   63.27s  ( 12.93s elapsed)

   %GC time       2.1%  (9.4% elapsed)

   Alloc rate    118,112,344 bytes per MUT second

   Productivity  97.9% of total user, 478.9% of total elapsed

 recordMutableGen_sync: 383
 gc_alloc_block_sync: 2
 whitehole_spin: 0
 gen[0].steps[0].sync_todo: 0
 gen[0].steps[0].sync_large_objects: 0
 gen[0].steps[1].sync_todo: 16
 gen[0].steps[1].sync_large_objects: 0
 gen[1].steps[0].sync_todo: 175
 gen[1].steps[0].sync_large_objects: 0

 real    0m12.933s
 user    1m3.276s
 sys     0m0.248s
 time ./Main +RTS -N8 -s
    7,316,581,044 bytes allocated in the heap
      153,424,584 bytes copied during GC
           11,964 bytes maximum residency (132 sample(s))
           65,916 bytes maximum slop
                6 MB total memory in use (0 MB lost due to fragmentation)

   Generation 0:  4711 collections,  4711 parallel, 136.08s, 21.38s elapsed
   Generation 1:   132 collections,   132 parallel,  6.12s,  0.93s elapsed

   Parallel GC work balance: 1.01 (38276826 / 38012855, ideal 8)

   Task  0 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:  30.31s  (  4.74s elapsed)

   Task  1 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:  25.09s  (  3.68s elapsed)

   Task  2 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:   0.09s  (  0.01s elapsed)

   Task  3 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:   0.00s  (  0.00s elapsed)

   Task  4 (worker) :  MUT time:   0.98s  ( 14.56s elapsed)
                       GC  time:   7.62s  (  1.17s elapsed)

   Task  5 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:  13.09s  (  2.01s elapsed)

   Task  6 (worker) :  MUT time:   5.81s  ( 14.56s elapsed)
                       GC  time:   1.92s  (  0.28s elapsed)

   Task  7 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:  25.12s  (  4.24s elapsed)

   Task  8 (worker) :  MUT time:   0.95s  ( 14.56s elapsed)
                       GC  time:   8.01s  (  1.20s elapsed)

   Task  9 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                       GC  time:  30.94s  (  4.98s elapsed)

   INIT  time    0.00s  (  0.00s elapsed)
   MUT   time   65.37s  ( 14.56s elapsed)
   GC    time  142.20s  ( 22.31s elapsed)
   EXIT  time    0.00s  (  0.01s elapsed)
   Total time  207.57s  ( 36.87s elapsed)

   %GC time      68.5%  (60.5% elapsed)

   Alloc rate    111,922,106 bytes per MUT second

   Productivity  31.5% of total user, 177.3% of total elapsed

 recordMutableGen_sync: 597
 gc_alloc_block_sync: 0
 whitehole_spin: 0
 gen[0].steps[0].sync_todo: 0
 gen[0].steps[0].sync_large_objects: 0
 gen[0].steps[1].sync_todo: 35296
 gen[0].steps[1].sync_large_objects: 0
 gen[1].steps[0].sync_todo: 557515
 gen[1].steps[0].sync_large_objects: 0

 real    0m36.873s
 user    3m27.569s
 sys     0m0.524s
 }}}

 See how the GC goes from 1s to 142s, just by virtue of -N8 instead of -N7.
 Increasing the heap size seems to alleviate the problem, but the
 difference is so severe (and unexpected, at least to me) that I thought
 I'd file a ticket.  I am happy to try other flags if you want.  This is
 all on GHC 6.10.4.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/3518>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to