Re: [Haskell-cafe] A problem with par and modules boundaries...
On Fri, 2009-05-22 at 05:30 -0700, Don Stewart wrote: Answer recorded at: http://haskell.org/haskellwiki/Performance/Parallel I have to complain, this answer doesn't explain anything. This isn't like straight-line performance, there's no reason as far as I can see that inlining should change the operational behaviour of parallel evaluation, unless there's some mistake in the original such as accidentally relying on an unspecified evaluation order. Now, I tried the example using two versions of ghc and I get different behaviour from what other people are seeing. With the original code, (ie parallelize function in the same module) with ghc-6.10.1 I get no speedup at all from -N2 and with 6.11 I get a very good speedup (though single threaded performance is slightly lower in 6.11) Original code ghc-6.10.1, -N1 -N2 real 0m9.435s0m9.328s user 0m9.369s0m9.249s ghc-6.11, -N1 -N2 real 0m10.262s 0m6.117s user 0m10.161s 0m11.093s With the parallelize function moved into another module I get no change whatsoever. Indeed even when I force it *not* to be inlined with {-# NOINLINE parallelize #-} then I still get no change in behaviour (as indeed I expected). So I view this advice to force inlining with great suspicion (at worst it encourages people not to think and to look at it as magic). That said, why it does not get any speedup with ghc-6.10 is also a mystery to me (there's very little GC going on). Don: can we change the advice on the wiki please? It currently makes it look like a known and understood issue. If anything we should suggest using a later ghc version. Duncan ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: [Haskell-cafe] A problem with par and modules boundaries...
On Fri, 2009-05-22 at 16:34 +0200, Daniel Fischer wrote: That's great, thank you. I am still baffled, though. I'm baffled too! I don't see the same behaviour at all (see the other email). Must every exported function that uses `par' be INLINEd? Does every exported caller of such a function need the same treatment? It really should not be necessary. Is `par' really a macro, rather than a function? It's a function. As far as I understand, par doesn't guarantee that both arguments are evaluated in parallel, it's just a suggestion to the compiler, and if whatever heuristics the compiler uses say it may be favourable to do it in parallel, it will produce code to calculate it in parallel (given appropriate compile- and run-time flags), otherwise it produces purely sequential code. With parallelize in a separate module, when compiling that, the compiler has no way to see whether parallelizing the computation may be beneficial, so doesn't produce (potentially) parallel code. At the use site, in the other module, it doesn't see the 'par', so has no reason to even consider producing parallel code. I don't think this is right. As I understand it, par always creates a spark. It has nothing to do with heuristics. Whether the spark actually gets evaluated in parallel depends on the runtime system and whether the spark fizzles before it gets a chance to run. Of course when using the single threaded rts then the sparks are never evaluated in parallel. With the threaded rts and given enough CPUs, the rts will try to schedule the sparks onto idle CPUs. This business of getting sparks running on other CPUs has improved significantly since ghc-6.10. The current development version uses a better concurrent queue data structure to manage the spark pool. That's probably the underlying reason for why the example works well in ghc-6.11 but works badly in 6.10. I'm afraid I'm not sure of what exactly is going wrong that means it doesn't work well in 6.10. Generally I'd expect the effect of par to be pretty insensitive to inlining. I'm cc'ing the ghc users list so perhaps we'll get some expert commentary. Duncan ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: [Haskell-cafe] A problem with par and modules boundaries...
Am Samstag 23 Mai 2009 13:06:04 schrieb Duncan Coutts: On Fri, 2009-05-22 at 16:34 +0200, Daniel Fischer wrote: That's great, thank you. I am still baffled, though. I'm baffled too! I don't see the same behaviour at all (see the other email). Must every exported function that uses `par' be INLINEd? Does every exported caller of such a function need the same treatment? It really should not be necessary. Is `par' really a macro, rather than a function? It's a function. As far as I understand, par doesn't guarantee that both arguments are evaluated in parallel, it's just a suggestion to the compiler, and if whatever heuristics the compiler uses say it may be favourable to do it in parallel, it will produce code to calculate it in parallel (given appropriate compile- and run-time flags), otherwise it produces purely sequential code. With parallelize in a separate module, when compiling that, the compiler has no way to see whether parallelizing the computation may be beneficial, so doesn't produce (potentially) parallel code. At the use site, in the other module, it doesn't see the 'par', so has no reason to even consider producing parallel code. I don't think this is right. As I understand it, par always creates a spark. It has nothing to do with heuristics. Quite possible. I was only guessing from the fact that sometimes par evaluates things in parallel and sometimes not, plus when thinking what might cause the described behaviour, cross-module inlining came to mind, I tried adding an INLINE pragma and it worked - or so it seemed. Then I threw together an explanation of the observed behaviour. That explanation must be wrong, though, see below. Whether the spark actually gets evaluated in parallel depends on the runtime system and whether the spark fizzles before it gets a chance to run. Of course when using the single threaded rts then the sparks are never evaluated in parallel. With the threaded rts and given enough CPUs, the rts will try to schedule the sparks onto idle CPUs. This business of getting sparks running on other CPUs has improved significantly since ghc-6.10. The current development version uses a better concurrent queue data structure to manage the spark pool. That's probably the underlying reason for why the example works well in ghc-6.11 but works badly in 6.10. I'm afraid I'm not sure of what exactly is going wrong that means it doesn't work well in 6.10. I have tried with 6.10.3 and 6.10.1, with parallelize in the same module and in a separate module - with no pragma - with an INLINE pragma - with a NOINLINE pragma 6.10.1 did not parallelize in any of these settings 6.10.3 parallelized in all these settings except separate module, no pragma. Then I tried a few other settigns with 6.10.3, got parallel evaluation if there's an INLINE or a NOINLINE pragma on parallelize, or the module header of Main is module Main (main) where, not if Main exports all top level definitions and parallelize is neither INLINEd nor NOINLINEd. Weird. Generally I'd expect the effect of par to be pretty insensitive to inlining. I'm cc'ing the ghc users list so perhaps we'll get some expert commentary. That would be good. Duncan Daniel ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: [Haskell-cafe] A problem with par and modules boundaries...
duncan.coutts: On Fri, 2009-05-22 at 05:30 -0700, Don Stewart wrote: Answer recorded at: http://haskell.org/haskellwiki/Performance/Parallel I have to complain, this answer doesn't explain anything. This isn't like straight-line performance, there's no reason as far as I can see that inlining should change the operational behaviour of parallel evaluation, unless there's some mistake in the original such as accidentally relying on an unspecified evaluation order. Now, I tried the example using two versions of ghc and I get different behaviour from what other people are seeing. With the original code, (ie parallelize function in the same module) with ghc-6.10.1 I get no speedup at all from -N2 and with 6.11 I get a very good speedup (though single threaded performance is slightly lower in 6.11) Original code ghc-6.10.1, -N1 -N2 real0m9.435s0m9.328s user0m9.369s0m9.249s ghc-6.11, -N1 -N2 real0m10.262s 0m6.117s user0m10.161s 0m11.093s With the parallelize function moved into another module I get no change whatsoever. Indeed even when I force it *not* to be inlined with {-# NOINLINE parallelize #-} then I still get no change in behaviour (as indeed I expected). So I view this advice to force inlining with great suspicion (at worst it encourages people not to think and to look at it as magic). That said, why it does not get any speedup with ghc-6.10 is also a mystery to me (there's very little GC going on). Don: can we change the advice on the wiki please? It currently makes it look like a known and understood issue. If anything we should suggest using a later ghc version. Please do so. Especially if GHC HEAD *does the right thing*. Then the advice should be first: upgrade to GHC HEAD. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users