Bulat Ziganshin wrote:
Hello Simon,

Thursday, December 15, 2005, 4:53:27 PM, you wrote:

SM> The 3k threads are still GC'd, but they are not actually *copied* during
SM> GC.

SM> It'll increase the memory overhead per thread from 2k (1k * 2 for
SM> copying) to 4k (4k block, no overhead for copying).

Simon, why not to include this in the "base package"? either change
something so that a 1k-threads will be not copied during GC, or at
least increment default stack size? this will improve performance of
other hyper-threaded programs. memory expenses seems not so great

Because it doesn't always improve things. This is a slightly modified version of the "cheap concurrency" benchmark from the shootout, first without tweaking -k:

> ./threads003 10000 +RTS -sstderr
./threads003 10000 +RTS -sstderr
 93,908,920 bytes allocated in the heap
159,724,208 bytes copied during GC (scavenged)
  1,559,376 bytes copied during GC (not scavenged)
 10,415,848 bytes maximum residency (4 sample(s))

        177 collections in generation 0 (  1.05s)
          4 collections in generation 1 (  0.02s)

         21 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.28s  (  1.28s elapsed)
  GC    time    1.06s  (  1.09s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.35s  (  2.37s elapsed)

  %GC time      45.3%  (45.9% elapsed)

  Alloc rate    73,149,011 bytes per MUT second

  Productivity  54.7% of total user, 54.1% of total elapsed

and now tweaking -k (using -k6k, because this is a 64-bit machine and storage manager blocks are 8k):

 > ./threads003 10000 +RTS -sstderr -k6k
./threads003 10000 +RTS -sstderr -k6k
168,837,736 bytes allocated in the heap
109,203,160 bytes copied during GC (scavenged)
  1,497,728 bytes copied during GC (not scavenged)
 71,180,464 bytes maximum residency (2 sample(s))

        156 collections in generation 0 (  1.06s)
          2 collections in generation 1 (  0.01s)

         86 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.48s  (  2.58s elapsed)
  GC    time    1.08s  (  1.08s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.56s  (  3.65s elapsed)

  %GC time      30.3%  (29.5% elapsed)

  Alloc rate    68,007,748 bytes per MUT second

  Productivity  69.7% of total user, 67.9% of total elapsed

My hypothesis is that when we give each thread its own memory block, all the thread stacks occupy the same cache lines and we end up with a lot more cache misses (notice it's the MUT time that increased, not the GC time).

Test program attached, if anyone's interested in digging further.

Cheers,
        Simon
-- $Id: message-ghc-2.code,v 1.3 2005/09/17 04:36:26 bfulgham Exp $
-- The Great Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- Contributed by Einar Karttunen
-- Modified by Simon Marlow

-- This is the shootout "cheap concurrency" benchmark, modified
-- slightly.  Modification noted below (***) to add more concurrency
-- and make a speedup on multiple processors available.

-- Creates 500 threads arranged in a sequence where each takes a value
-- from the left, adds 1, and passes it to the right (via MVars).
-- N more threads pump zeros in at the left.  A sub-thread
-- takes N values from the right and sums them.
-- 

import Control.Concurrent
import Control.Monad
import System

thread :: MVar Int -> MVar Int -> IO ()
thread inp out = do x <- takeMVar inp; putMVar out $! x+1; thread inp out

spawn cur _ = do next <- newEmptyMVar
                 forkIO $ thread cur next
                 return next

main = do n <- getArgs >>= readIO.head
          s <- newEmptyMVar
          e <- foldM spawn s [1..500]
          f <- newEmptyMVar
          forkIO $ replicateM n (takeMVar e) >>= putMVar f . sum
          replicateM n (forkIO $ putMVar s 0)
-- ***    replicateM n (putMVar s 0)
          takeMVar f

-- vim: ts=4 ft=haskell
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to