Re: [Haskell-beginners] Performance of parallel mergesort

2009-12-30 Thread Simon Marlow

On 28/12/09 12:56, Yitzchak Gale wrote:


On Mon, Dec 28, 2009 at 5:19 AM, Jon Harrop wrote:

On Sunday 27 December 2009 20:56:51 Stephen Blackheath wrote:

Jon Harrop wrote:

This is something that concerns me. Lots of discussions of parallelism,
including the Haskell literature I have read, neglect this critical
problem
of making sure that more time is spent doing useful work than spawning
tasks
(sparks). How is this solved in Haskell? Do you just put magic numbers in
that work on the machine you're currently using?


It is simply not true that Haskell literature neglects the question of
spark granularity - this is very basic and is always covered. Read
"Real World Haskell" (available free online).  There's no 'magic
number'. You must explicitly write your code to give the right granule
size.


There is no "right granule" size. That's the whole point: the optimum is a
function of the machine. If you hardcode the granularity then your code isn't
future proof and isn't portable.


While that's true, in practice it's often not a problem.

Sometimes you can pick a granularity that is small enough to give you 
thousands of parallel tasks, each of which is plenty big enough to 
outweight the overheads of task creation (which are very low in GHC 
anyway).  Scaling to thousands of cores is likely to be good enough for 
quite a while.


You can always pick a granularity based on the number of cores (GHC 
gives you access to that as GHC.Conc.numCapabilities).


Or, the program might have a natural granularity, which leaves you 
little room for parallelising it in a different way.



Moreover, their approach of subdividing a fixed number of times is suboptimal
because it inhibits load balancing.


If you create enough tasks, then load balancing works just fine.  GHC 
(in 6.12.1) uses work-stealing for load balancing, incedentally.



Later, about parallelized IO, they give the code:

  chunkedReadWith :: (NFData a) =>
([LB.ByteString] ->  a) ->  FilePath ->  IO a
  chunkedReadWith func path =
  withChunks (lineChunks (numCapabilities * 4)) func path

where "4" is one magic number that gets multiplied by the magic number the
user supplied via the +RTS -N  command-line option.

They make no attempt to adapt the granularity to the machine at all and rely
entirely upon magic numbers. Consequently, their parallel sort that got a 25%
speedup on two cores achieves a 30% slowdown on my 8 core.


I'd recommend trying againn with 6.12.1.  You might also want to 
experiment with the parallel GC settings - the defaults settings aren't 
perfect for every program.



I don't know the exact cost of sparking, but in my experience it
is quite small - so - as long as your granule is doing *some* real work,
it should speed up.


Can you quantify it, e.g. How many FLOPS?


you can't measure time in FLOPS.  But my guess is that a spark could be 
as low as 20 cycles if everything is in the L1 cache and the branches 
are predicted correctly (there are 2 function calls, 2 branches, and 
about 3 memory references, we could inline to remove the function 
calls). It's just a push onto a lock-free work-stealing queue, and 
involves no atomic instructions.



2. The defaults generally work better than giving huge heap sizes.  Your
-K1 - maximum heap size per thread - will either do nothing or
cause an artificial slowdown (I have observed this with the minimum heap
size option).  Don't use it, unless experimentation proves it makes
things better.


Yes, this is because cache use and locality become more important with 
multicore execution.  The default GC settings use a 512KB young 
generation and we use locality-optimised parallel GC in 6.12.1.



On the contrary, the Haskell program dies without it:

$ time ./mergesort +RTS -N8
Stack space overflow: current size 8388608 bytes.


Don't confuse stack size (-K) with heap size (-H or -M).  The stack size 
is just a limit and shouldn't have any effect on performance; the stack 
itself grows dynamically.



Its says 78.5% GC time (with GHC 6.10).


A clear sign that there's a problem.  I haven't tried parallelising 
merge sort recently, but it has been a tricky one to parallelise with 
GHC in the past due to the amount of GC going on.



5. There seems to be a scalability problem with the parallel gc for
larger numbers of cores (namely 8).  I am guessing somewhat, but my
experiments tend to confirm the issue raised in Simon Marlow's (the
implementor of GHC parallelization) recent paper that it's to do with
"stopping the world" for gc.


Do you mean this bug:

  http://hackage.haskell.org/trac/ghc/ticket/3553


Probably, yes.  I'm working on indepdendent young-generation collection 
at the moment which will fix that bug properly.  For now, make sure your 
cores are all available (nice and +RTS -qa are useful here), and on 8 
core machines I usually drop back to -N7.



If GHC's lack of perfection at this point in time makes Haskell "look
bad" I don't mind.  I am not 

Re: [Haskell-beginners] Performance of parallel mergesort

2009-12-29 Thread Simon Marlow

On 28/12/09 19:01, Jon Harrop wrote:

On Monday 28 December 2009 12:56:17 Yitzchak Gale wrote:

This discussion definitely does not belong on the
Haskell-Beginners list. Besides not being a topic
for beginners, being there is keeping it under the
radar of many or most of the people who work on
these things in Haskell.

I am moving the discussion to the GHC users list,
as suggested by Antoine. For those joining us in
progress, the first part of the thread starts here:

http://www.haskell.org/pipermail/beginners/2009-December/003045.html


I am not familiar with the GHC users list but is this discussion about basic
parallelism really more relevant here? The interesting bits aren't GHC
specific...


Discussion about support for parallelism in GHC is very on-topic here. 
Parallelism is not part of the Haskell language per se, and although the 
basic par/pseq are not implementation-specific, using them effectively 
may require knowing something about the implementation.  For instance, 
in GHC 6.12 we reduced some overheads and made it possible to 
parallelise some fine-grained parallel problems that previously resulted 
in slowdown on a multicore.


(I'll respond to the particular issues raised in the earlier message 
later, I'm still on holiday right now)



Also, is this list censored like the Haskell Cafe or is it open?


Haskell lists are only ever moderated on an ad-hoc basis as necessary. 
We've never used moderation on this particular list.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: [Haskell-beginners] Performance of parallel mergesort

2009-12-28 Thread Antoine Latter
On Mon, Dec 28, 2009 at 2:01 PM, Jon Harrop  wrote:
> On Monday 28 December 2009 12:56:17 Yitzchak Gale wrote:
>> This discussion definitely does not belong on the
>> Haskell-Beginners list. Besides not being a topic
>> for beginners, being there is keeping it under the
>> radar of many or most of the people who work on
>> these things in Haskell.
>>
>> I am moving the discussion to the GHC users list,
>> as suggested by Antoine. For those joining us in
>> progress, the first part of the thread starts here:
>>
>> http://www.haskell.org/pipermail/beginners/2009-December/003045.html
>
> I am not familiar with the GHC users list but is this discussion about basic
> parallelism really more relevant here? The interesting bits aren't GHC
> specific...

I don't think that any other implementation has support for the 'par'
operator, but I could see a case for either haskell-cafe or ghc-users
being a good spot for discussion. The beginers list is targeted at
raw-recruits or students overcoming their first encounter with
Haskell.

>
> Also, is this list censored like the Haskell Cafe or is it open?
>

I've never had any messages removed or moderated in any of the
@haskell.org lists, but maybe I'm not saying the right things :-)

Antoine
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: [Haskell-beginners] Performance of parallel mergesort

2009-12-28 Thread Jon Harrop
On Monday 28 December 2009 12:56:17 Yitzchak Gale wrote:
> This discussion definitely does not belong on the
> Haskell-Beginners list. Besides not being a topic
> for beginners, being there is keeping it under the
> radar of many or most of the people who work on
> these things in Haskell.
>
> I am moving the discussion to the GHC users list,
> as suggested by Antoine. For those joining us in
> progress, the first part of the thread starts here:
>
> http://www.haskell.org/pipermail/beginners/2009-December/003045.html

I am not familiar with the GHC users list but is this discussion about basic 
parallelism really more relevant here? The interesting bits aren't GHC 
specific...

Also, is this list censored like the Haskell Cafe or is it open?

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: [Haskell-beginners] Performance of parallel mergesort

2009-12-28 Thread Yitzchak Gale
This discussion definitely does not belong on the
Haskell-Beginners list. Besides not being a topic
for beginners, being there is keeping it under the
radar of many or most of the people who work on
these things in Haskell.

I am moving the discussion to the GHC users list,
as suggested by Antoine. For those joining us in
progress, the first part of the thread starts here:

http://www.haskell.org/pipermail/beginners/2009-December/003045.html

On Mon, Dec 28, 2009 at 5:19 AM, Jon Harrop wrote:
> On Sunday 27 December 2009 20:56:51 Stephen Blackheath wrote:
>> Jon Harrop wrote:
>> > This is something that concerns me. Lots of discussions of parallelism,
>> > including the Haskell literature I have read, neglect this critical
>> > problem
>> > of making sure that more time is spent doing useful work than spawning
>> > tasks
>> > (sparks). How is this solved in Haskell? Do you just put magic numbers in
>> > that work on the machine you're currently using?
>>
>> It is simply not true that Haskell literature neglects the question of
>> spark granularity - this is very basic and is always covered. Read
>> "Real World Haskell" (available free online).  There's no 'magic
>> number'. You must explicitly write your code to give the right granule
>> size.
>
> There is no "right granule" size. That's the whole point: the optimum is a
> function of the machine. If you hardcode the granularity then your code isn't
> future proof and isn't portable.
>
> From chapter 24 of Real World Haskell on sorting:
>
>  "At this fine granularity, the cost of using par outweighs any possible
> usefulness. To reduce this effect, we switch to our non-parallel sort after
> passing some threshold."
>
> From the Sorting.hs file, parSort2 accepts a threshold "d":
>
>  parSort2 :: (Ord a) => Int -> [a] -> [a]
>  parSort2 d list@(x:xs)
>    | d <= 0     = sort list
>    | otherwise = force greater `par` (force lesser `pseq`
>                                      (lesser ++ x:greater))
>        where lesser      = parSort2 d' [y | y <- xs, y <  x]
>              greater     = parSort2 d' [y | y <- xs, y >= x]
>              d' = d - 1
>  parSort2 _ _              = []
>
> From the SortMain.hs file, it is always invoked with the magic number "2":
>
>  testFunction = parSort2 2
>
> Moreover, their approach of subdividing a fixed number of times is suboptimal
> because it inhibits load balancing.
>
> Later, about parallelized IO, they give the code:
>
>  chunkedReadWith :: (NFData a) =>
>                    ([LB.ByteString] -> a) -> FilePath -> IO a
>  chunkedReadWith func path =
>      withChunks (lineChunks (numCapabilities * 4)) func path
>
> where "4" is one magic number that gets multiplied by the magic number the
> user supplied via the +RTS -N command-line option.
>
> They make no attempt to adapt the granularity to the machine at all and rely
> entirely upon magic numbers. Consequently, their parallel sort that got a 25%
> speedup on two cores achieves a 30% slowdown on my 8 core.
>
>> I don't know the exact cost of sparking, but in my experience it
>> is quite small - so - as long as your granule is doing *some* real work,
>> it should speed up.
>
> Can you quantify it, e.g. How many FLOPS?
>
>> Jon Harrop wrote:
>> > The parallelism is obviously not obtaining any real speedup so something
>> > is wrong. But can anyone fix it?
>>
>> I've spent a bit of time measuring parallel speedup on real commercial
>> projects, and this is what I've found:
>>
>> 1. ghc-6.12 performs significantly better than ghc-6.10, and has now
>> been released, therefore don't use ghc-6.10.
>
> Ok.
>
>> 2. The defaults generally work better than giving huge heap sizes.  Your
>> -K1 - maximum heap size per thread - will either do nothing or
>> cause an artificial slowdown (I have observed this with the minimum heap
>> size option).  Don't use it, unless experimentation proves it makes
>> things better.
>
> On the contrary, the Haskell program dies without it:
>
> $ time ./mergesort +RTS -N8
> Stack space overflow: current size 8388608 bytes.
> Use `+RTS -Ksize' to increase it.
> [1]+  Done                    kwrite mergesort.hs
>
> real    0m33.320s
> user    3m29.397s
> sys     0m0.592s
>
> I had to add that -K command line option just to get the program to run to
> completion.
>
>> 3. +RTS -s is well worth using. It breaks the time down into MUT
>> (mutator) and GC (garbage collector).
>
> Its says 78.5% GC time (with GHC 6.10).
>
>> 4. MUT parallelization is excellent, but the parallel GC is not so good.
>>  If your code is GC-heavy it can spend around half of its time garbage
>> collecting, which doesn't parallelize so well, and this eats into the
>> speedup.
>
> Ok.
>
>> 5. There seems to be a scalability problem with the parallel gc for
>> larger numbers of cores (namely 8).  I am guessing somewhat, but my
>> experiments tend to confirm the issue raised in Simon Marlow's (the
>> implementor of GHC parallelization) recent paper that it's to do with
>> "