Re: [Haskell-cafe] A problem with par and modules boundaries...

2009-05-23 Thread Duncan Coutts
On Fri, 2009-05-22 at 05:30 -0700, Don Stewart wrote:
 Answer recorded at:
 
 http://haskell.org/haskellwiki/Performance/Parallel

I have to complain, this answer doesn't explain anything. This isn't
like straight-line performance, there's no reason as far as I can see
that inlining should change the operational behaviour of parallel
evaluation, unless there's some mistake in the original such as
accidentally relying on an unspecified evaluation order.

Now, I tried the example using two versions of ghc and I get different
behaviour from what other people are seeing. With the original code, (ie
parallelize function in the same module) with ghc-6.10.1 I get no
speedup at all from -N2 and with 6.11 I get a very good speedup (though
single threaded performance is slightly lower in 6.11)

Original code
  ghc-6.10.1,   -N1 -N2
  real  0m9.435s0m9.328s
  user  0m9.369s0m9.249s

  ghc-6.11, -N1 -N2
  real  0m10.262s   0m6.117s
  user  0m10.161s   0m11.093s

With the parallelize function moved into another module I get no change
whatsoever. Indeed even when I force it *not* to be inlined with {-#
NOINLINE parallelize #-} then I still get no change in behaviour (as
indeed I expected).

So I view this advice to force inlining with great suspicion (at worst
it encourages people not to think and to look at it as magic). That
said, why it does not get any speedup with ghc-6.10 is also a mystery to
me (there's very little GC going on).

Don: can we change the advice on the wiki please? It currently makes it
look like a known and understood issue. If anything we should suggest
using a later ghc version.

Duncan

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: [Haskell-cafe] A problem with par and modules boundaries...

2009-05-23 Thread Duncan Coutts
On Fri, 2009-05-22 at 16:34 +0200, Daniel Fischer wrote:

  That's great, thank you. I am still baffled, though.

I'm baffled too! I don't see the same behaviour at all (see the other
email).

  Must every exported function that uses `par' be INLINEd? Does every
  exported caller of such a function need the same treatment?

It really should not be necessary.

  Is `par' really a macro, rather than a function?

It's a function.

 As far as I understand, par doesn't guarantee that both arguments are
 evaluated in parallel, it's just a suggestion to the compiler, and if
 whatever heuristics the compiler uses say it may be favourable to do
 it in parallel, it will produce code to calculate it in parallel
 (given appropriate compile- and run-time flags), otherwise it produces
 purely sequential code.
 
 With parallelize in a separate module, when compiling that, the
 compiler has no way to see whether parallelizing the computation may
 be beneficial, so doesn't produce (potentially) parallel code. At the
 use site, in the other module, it doesn't see the 'par', so has no 
 reason to even consider producing parallel code.

I don't think this is right. As I understand it, par always creates a
spark. It has nothing to do with heuristics.

Whether the spark actually gets evaluated in parallel depends on the
runtime system and whether the spark fizzles before it gets a chance
to run. Of course when using the single threaded rts then the sparks are
never evaluated in parallel. With the threaded rts and given enough
CPUs, the rts will try to schedule the sparks onto idle CPUs. This
business of getting sparks running on other CPUs has improved
significantly since ghc-6.10. The current development version uses a
better concurrent queue data structure to manage the spark pool. That's
probably the underlying reason for why the example works well in
ghc-6.11 but works badly in 6.10. I'm afraid I'm not sure of what
exactly is going wrong that means it doesn't work well in 6.10.

Generally I'd expect the effect of par to be pretty insensitive to
inlining. I'm cc'ing the ghc users list so perhaps we'll get some expert
commentary.

Duncan

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: [Haskell-cafe] A problem with par and modules boundaries...

2009-05-23 Thread Daniel Fischer
Am Samstag 23 Mai 2009 13:06:04 schrieb Duncan Coutts:
 On Fri, 2009-05-22 at 16:34 +0200, Daniel Fischer wrote:
 That's great, thank you. I am still baffled, though.

 I'm baffled too! I don't see the same behaviour at all (see the other
 email).

   Must every exported function that uses `par' be INLINEd? Does every
   exported caller of such a function need the same treatment?

 It really should not be necessary.

   Is `par' really a macro, rather than a function?

 It's a function.

  As far as I understand, par doesn't guarantee that both arguments are
  evaluated in parallel, it's just a suggestion to the compiler, and if
  whatever heuristics the compiler uses say it may be favourable to do
  it in parallel, it will produce code to calculate it in parallel
  (given appropriate compile- and run-time flags), otherwise it produces
  purely sequential code.
 
  With parallelize in a separate module, when compiling that, the
  compiler has no way to see whether parallelizing the computation may
  be beneficial, so doesn't produce (potentially) parallel code. At the
  use site, in the other module, it doesn't see the 'par', so has no
  reason to even consider producing parallel code.

 I don't think this is right. As I understand it, par always creates a
 spark. It has nothing to do with heuristics.

Quite possible.
I was only guessing from the fact that sometimes par evaluates things in 
parallel and 
sometimes not, plus when thinking what might cause the described behaviour, 
cross-module 
inlining came to mind, I tried adding an INLINE pragma and it worked - or so it 
seemed. 
Then I threw together an explanation of the observed behaviour. That 
explanation must be 
wrong, though, see below.


 Whether the spark actually gets evaluated in parallel depends on the
 runtime system and whether the spark fizzles before it gets a chance
 to run. Of course when using the single threaded rts then the sparks are
 never evaluated in parallel. With the threaded rts and given enough
 CPUs, the rts will try to schedule the sparks onto idle CPUs. This
 business of getting sparks running on other CPUs has improved
 significantly since ghc-6.10. The current development version uses a
 better concurrent queue data structure to manage the spark pool. That's
 probably the underlying reason for why the example works well in
 ghc-6.11 but works badly in 6.10. I'm afraid I'm not sure of what
 exactly is going wrong that means it doesn't work well in 6.10.

I have tried with 6.10.3 and 6.10.1,  with parallelize in the same module and 
in a 
separate module
- with no pragma
- with an INLINE pragma
- with a NOINLINE pragma

6.10.1 did not parallelize in any of these settings
6.10.3 parallelized in all these settings except separate module, no pragma.

Then I tried a few other settigns with 6.10.3, got parallel evaluation if 
there's an 
INLINE or a NOINLINE pragma on parallelize, or the module header of Main is 
module Main (main) where,
not if Main exports all top level definitions and parallelize is neither 
INLINEd nor 
NOINLINEd.

Weird.


 Generally I'd expect the effect of par to be pretty insensitive to
 inlining. I'm cc'ing the ghc users list so perhaps we'll get some expert
 commentary.

That would be good.


 Duncan


Daniel

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: [Haskell-cafe] A problem with par and modules boundaries...

2009-05-23 Thread Don Stewart
duncan.coutts:
 On Fri, 2009-05-22 at 05:30 -0700, Don Stewart wrote:
  Answer recorded at:
  
  http://haskell.org/haskellwiki/Performance/Parallel
 
 I have to complain, this answer doesn't explain anything. This isn't
 like straight-line performance, there's no reason as far as I can see
 that inlining should change the operational behaviour of parallel
 evaluation, unless there's some mistake in the original such as
 accidentally relying on an unspecified evaluation order.
 
 Now, I tried the example using two versions of ghc and I get different
 behaviour from what other people are seeing. With the original code, (ie
 parallelize function in the same module) with ghc-6.10.1 I get no
 speedup at all from -N2 and with 6.11 I get a very good speedup (though
 single threaded performance is slightly lower in 6.11)
 
 Original code
   ghc-6.10.1, -N1 -N2
   real0m9.435s0m9.328s
   user0m9.369s0m9.249s
 
   ghc-6.11,   -N1 -N2
   real0m10.262s   0m6.117s
   user0m10.161s   0m11.093s
 
 With the parallelize function moved into another module I get no change
 whatsoever. Indeed even when I force it *not* to be inlined with {-#
 NOINLINE parallelize #-} then I still get no change in behaviour (as
 indeed I expected).
 
 So I view this advice to force inlining with great suspicion (at worst
 it encourages people not to think and to look at it as magic). That
 said, why it does not get any speedup with ghc-6.10 is also a mystery to
 me (there's very little GC going on).
 
 Don: can we change the advice on the wiki please? It currently makes it
 look like a known and understood issue. If anything we should suggest
 using a later ghc version.

Please do so. Especially if GHC HEAD *does the right thing*. Then the
advice should be first: upgrade to GHC HEAD.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users