Re: ghci and ghc -threaded [slowdown]

2008-12-15 Thread Malcolm Wallace
 It seems that the problem you have is that moving to the multithreaded
 runtime imposes an overhead on the communication between your two
 threads,  when run on a *single CPU*.  But performance on a single CPU
 is not what  you're interested in - you said you wanted parallelism,
 and for that you  need multiple CPUs, and hence multiple OS threads.

Well, I'm interested in getting an absolute speedup.  If the threaded
performance on a single core is slightly slower than the non-threaded
performance on a single core, that would be OK provided that the
threaded performance using multiple cores was better than the same
non-threaded baseline.

However, it doesn't seem to work like that at all.  In fact, threaded on
multiple cores was _even_slower_ than threaded on a single core!

Here are some figures:

ghc-6.8.2 -O2  
 apply   MVarstrict  thr-N2  thr-N1
silicium  7.307.95 7.23   15.25  14.71
neghip4.254.43 4.186.67   6.48
hydrogen 11.75   10.8210.99   13.45  12.96
lobster  55.851.5 57.676.6   74.5

The first three columns are variations of the program using slightly
different communications mechanisms, including threads/MVars with the
non-threaded RTS.  The final two columns are for the MVar mechanism
with threaded RTS and either 1 or 2 cores.  -N2 is slowest.

 I suspect the underlying problem in your program is that the
 communication  is synchronous.  To get good parallelism you'll need to
 use asynchronous  communication, otherwise even on multiple CPUs
 you'll see little  parallelism.

I tried using Chans instead of MVars, to provide for different speeds of
reader/writer, but the timings were even worse.  (Add another 15-100%.)

When I have time to look at this again (probably in the New Year), I
will try some other strategies for communication that vary in their
synchronous/asynchronous chunk size, to see if I can pin things down
more closely.

Regards,
Malcolm
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: ghci and ghc -threaded [slowdown]

2008-12-15 Thread Simon Marlow

Malcolm Wallace wrote:

It seems that the problem you have is that moving to the multithreaded
runtime imposes an overhead on the communication between your two
threads,  when run on a *single CPU*.  But performance on a single CPU
is not what  you're interested in - you said you wanted parallelism,
and for that you  need multiple CPUs, and hence multiple OS threads.


Well, I'm interested in getting an absolute speedup.  If the threaded
performance on a single core is slightly slower than the non-threaded
performance on a single core, that would be OK provided that the
threaded performance using multiple cores was better than the same
non-threaded baseline.

However, it doesn't seem to work like that at all.  In fact, threaded on
multiple cores was _even_slower_ than threaded on a single core!


Entirely possible - unless there's any actual parallelism, running on 
multiple cores will probably slow things down due to thread migration.



Here are some figures:

ghc-6.8.2 -O2  
 apply   MVarstrict  thr-N2  thr-N1

silicium  7.307.95 7.23   15.25  14.71
neghip4.254.43 4.186.67   6.48
hydrogen 11.75   10.8210.99   13.45  12.96
lobster  55.851.5 57.676.6   74.5

The first three columns are variations of the program using slightly
different communications mechanisms, including threads/MVars with the
non-threaded RTS.  The final two columns are for the MVar mechanism
with threaded RTS and either 1 or 2 cores.  -N2 is slowest.


So you're not getting any parallelism at all, for some reason your program 
is sequentialised.  There could be any number of reasons for this.



I suspect the underlying problem in your program is that the
communication  is synchronous.  To get good parallelism you'll need to
use asynchronous  communication, otherwise even on multiple CPUs
you'll see little  parallelism.


I tried using Chans instead of MVars, to provide for different speeds of
reader/writer, but the timings were even worse.  (Add another 15-100%.)


That would seem to indicate that your program is doing a lot of 
communication - I'd look at trying to reduce that, by increasing task size 
or whatever.  However, the amount of communication is obviously not the 
only issue, there also seems to be some kind of dependency that 
sequentialises the program.


Are you sure that you're not accidentally communicating thunks, and hence 
doing all the computation in one of the threads?  That's a common pitfall 
that has caught me more than once.


Do you know roughly the amount of parallelism you expect - i.e. the amount 
of work done by each thread?



When I have time to look at this again (probably in the New Year), I
will try some other strategies for communication that vary in their
synchronous/asynchronous chunk size, to see if I can pin things down
more closely.


That would be good.  At some point we hope to provide some kind of 
visualisation to let you see where the parallel performance bottlenecks in 
your program are; there are various ongoing efforts but nothing useable as yet.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: ghci and ghc -threaded [slowdown]

2008-12-15 Thread Simon Marlow

Malcolm Wallace wrote:

Simon Marlow marlo...@gmail.com wrote:


Malcolm Wallace wrote:

For the only application I tried, using the threaded RTS imposes a
100% performance penalty - i.e. computation time doubles, compared
to the non-threaded RTS.  This was with ghc-6.8.2, and maybe the
overhead has improved since then?

This is a guess, but I wonder if this program is concurrent, and does
a  lot of communication between the main thread and other threads? 


Exactly so - it hits the worst case behaviour.  This was a naive attempt
to parallelise an algorithm by shifting some work onto a spare
processor.  Unfortunately, there is a lot of communication to the main
thread, because the work that was shifted elsewhere computes a large
data structure in chunks, and passes those chunks back.  The main thread
then runs OpenGL calls using this data -- and I believe OpenGL calls must
run in a bound thread.

This all suggests that one consequence of ghc's RTS implementation
choices is that it will never be cheap to compute visualization data in
parallel with rendering it in OpenGL.  That would be a shame.  This was
exactly the parallelism I was hoping for.


I'm not sure how we could do any better here.  To get parallelism you need 
to run the OpenGL thread and the worker thread on separate OS threads, 
which we do.  So what aspect of the RTS design is preventing you from 
getting the parallelism you want?


It seems that the problem you have is that moving to the multithreaded 
runtime imposes an overhead on the communication between your two threads, 
when run on a *single CPU*.  But performance on a single CPU is not what 
you're interested in - you said you wanted parallelism, and for that you 
need multiple CPUs, and hence multiple OS threads.


I suspect the underlying problem in your program is that the communication 
is synchronous.  To get good parallelism you'll need to use asynchronous 
communication, otherwise even on multiple CPUs you'll see little 
parallelism.  If you still do asynchronous communication and yet don't get 
good parallelism, then we should look into what's causing that.


Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: ghci and ghc -threaded [slowdown]

2008-12-12 Thread Malcolm Wallace
Simon Marlow marlo...@gmail.com wrote:

 Malcolm Wallace wrote:
  
  For the only application I tried, using the threaded RTS imposes a
  100% performance penalty - i.e. computation time doubles, compared
  to the non-threaded RTS.  This was with ghc-6.8.2, and maybe the
  overhead has improved since then?
 
 This is a guess, but I wonder if this program is concurrent, and does
 a  lot of communication between the main thread and other threads? 

Exactly so - it hits the worst case behaviour.  This was a naive attempt
to parallelise an algorithm by shifting some work onto a spare
processor.  Unfortunately, there is a lot of communication to the main
thread, because the work that was shifted elsewhere computes a large
data structure in chunks, and passes those chunks back.  The main thread
then runs OpenGL calls using this data -- and I believe OpenGL calls must
run in a bound thread.

This all suggests that one consequence of ghc's RTS implementation
choices is that it will never be cheap to compute visualization data in
parallel with rendering it in OpenGL.  That would be a shame.  This was
exactly the parallelism I was hoping for.

Regards,
Malcolm
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users