Re: Performance degradation when factoring out common code

Harendra Kumar Sat, 09 Sep 2017 06:24:54 -0700

I could pinpoint one part of the problem. Please see the ticket:
https://ghc.haskell.org/trac/ghc/ticket/14208. Here is the description that
I wrote in the ticket:


In this particular case -O2 is 2x slower than -O0 and -O0 is 2x slower than
runghc. Please see the github repo: 
https://github.com/harendra-kumar/ghc-perf to reproduce the issue. Readme
file in the repo has instructions to reproduce.

The issue seems to occur when the code is placed in a different module.
When all the code is in the same module the problem does not occur. In that
case -O2 is faster than -O0. However, when the code is split into two
modules the performance gets inverted.

Also, it does not occur always, when I tried to change the code to make it
simpler for repro the problem did not occur.


-harendra

On 9 September 2017 at 14:08, Harendra Kumar <[email protected]>
wrote:

> The code is at: https://github.com/harendra-kumar/asyncly. The benchmark
> code is in "benchmark/Main.hs".  The relevant function is "asyncly_basic".
>
> If you want to run it, you can use the following steps to reproduce the
> behavior I reported below:
>
> 1) Run "stack build"
> 2) Run "stack runghc benchmark/Main.hs" for runghc figures
> 3) Run "stack ghc benchmark/Main.hs && benchmark/Main" to compile and run
> normally
> 4) Run "stack ghc -- -O2 benchmark/Main.hs && benchmark/Main" to compile
> and run with -O2 flag
>
> Just look at the first benchmark (asyncly-serial), you can comment out all
> others if you want to. Note that the library gets compiled without any
> optimization flags (see the ghc options in the cabal file). So what we are
> seeing here is just the effect of -O2 on compiling benchmarks/Main.hs.
>
> I am also trying to isolate the problem to a minimal case. I tried
> removing all the INLINE pragmas in the library to make sure that I am not
> screwing it up by asking the compiler to inline aggressively, but that does
> not seem to make any difference to the situation. Let me know if you need
> any information from me or help in running it.
>
> There are three issues that I am trying to get answers for:
>
> 1) Why runghc is faster? It means that there is a possibility for the
> program to run as fast as runghc runs it. How do I get that performance or
> an explanation of it?
>
> 2) Why -O1/O2 degrades performance so much by 4-5x.
>
> 3) The third one is the original problem that I posted in this thread,
> compiler is unable to match manual inlining. It is possible that this is an
> issue only when -O1/O2 is used and not when -O0 is used.
>
> Thanks for the help.
>
> -harendra
>
>
> On 9 September 2017 at 13:30, Matthew Pickering <
> [email protected]> wrote:
>
>> Do you have the code?
>>
>> On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar <[email protected]>
>> wrote:
>> > While trying to come up with a minimal example I discovered one more
>> > puzzling thing. runghc is fastest, ghc is slower, ghc with optimization
>> is
>> > slowest. This is completely reverse of the expected order.
>> >
>> > ghc -O1 (-O2 is similar):
>> >
>> > time                 15.23 ms   (14.72 ms .. 15.73 ms)
>> >
>> > ghc -O0:
>> >
>> > time                 3.612 ms   (3.548 ms .. 3.728 ms)
>> >
>> > runghc:
>> >
>> > time                 2.250 ms   (2.156 ms .. 2.348 ms)
>> >
>> >
>> > I am grokking it further. Any pointers will be helpful. I understand
>> that
>> > -O2 can sometimes be slower e.g. aggressive inlining can sometimes be
>> > counterproductive. But 4x variation is a lot and this is the case with
>> -O1
>> > as well which should be relatively safer than -O2 in general. Worst of
>> all
>> > runghc is significantly faster than ghc. What's going on?
>> >
>> > -harendra
>> >
>> >
>> > On 8 September 2017 at 18:49, Harendra Kumar <[email protected]>
>> > wrote:
>> >>
>> >> I will try creating a minimal example and open a ticket for the
>> inlining
>> >> problem, the one I am sure about.
>> >>
>> >> -harendra
>> >>
>> >> On 8 September 2017 at 18:35, Simon Peyton Jones <
>> [email protected]>
>> >> wrote:
>> >>>
>> >>> I know that this is not an easy request, but can either of you
>> produce a
>> >>> small example that demonstrates your problem?   If so, please open a
>> ticket.
>> >>>
>> >>>
>> >>>
>> >>> I don’t like hearing about people having to use trial and error  with
>> >>> INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the
>> problem
>> >>> unless I can reproduce it.
>> >>>
>> >>>
>> >>>
>> >>> Simon
>> >>>
>> >>>
>> >>>
>> >>> From: ghc-devs [mailto:[email protected]] On Behalf Of
>> >>> Harendra Kumar
>> >>> Sent: 08 September 2017 13:50
>> >>> To: Mikolaj Konarski <[email protected]>
>> >>> Cc: [email protected]
>> >>> Subject: Re: Performance degradation when factoring out common code
>> >>>
>> >>>
>> >>>
>> >>> I should also point out that I saw performance improvements by
>> manually
>> >>> factoring out and propagating some common expressions to outer loops
>> in
>> >>> performance sensitive paths. Now I have made this a habit to do this
>> >>> manually. Not sure if something like this has also been fixed with
>> that
>> >>> ticket or some other ticket.
>> >>>
>> >>>
>> >>>
>> >>> -harendra
>> >>>
>> >>>
>> >>>
>> >>> On 8 September 2017 at 17:34, Harendra Kumar <
>> [email protected]>
>> >>> wrote:
>> >>>
>> >>> Thanks Mikolaj! I have seen some surprising behavior quite a few times
>> >>> recently and I was wondering whether GHC should do better. In one
>> case I had
>> >>> to use SPECIALIZE very aggressively, in another version of the same
>> code it
>> >>> worked well without that. I have been doing a lot of trial and error
>> with
>> >>> the INLINE/NOINLINE pragmas to figure out what the right combination
>> is.
>> >>> Sometimes it just feels like black magic, because I cannot find a
>> rationale
>> >>> to explain the behavior. I am not sure if there are any more such
>> problems
>> >>> lurking in, perhaps this is an area where some improvement looks
>> possible.
>> >>>
>> >>>
>> >>>
>> >>> -harendra
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 8 September 2017 at 17:10, Mikolaj Konarski
>> >>> <[email protected]> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>> I've had a similar problem that's been fixed in 8.2.1:
>> >>>
>> >>> https://ghc.haskell.org/trac/ghc/ticket/12603
>> >>>
>> >>> You can also use some extreme global flags, such as
>> >>>
>> >>> ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
>> >>>
>> >>> to get most the GHC subtlety and shyness out of the way
>> >>> when experimenting.
>> >>>
>> >>> Good luck
>> >>> Mikolaj
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
>> >>> <[email protected]> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > I have this code snippet for the bind implementation of a Monad:
>> >>> >
>> >>> >     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>> >>> >         let run x = (runAsyncT x) Nothing stp yld
>> >>> >             yield a _ Nothing  = run $ f a
>> >>> >             yield a _ (Just r) = run $ f a <> (r >>= f)
>> >>> >         in m Nothing stp yield
>> >>> >
>> >>> > I want to have multiple versions of this implementation
>> parameterized
>> >>> > by a
>> >>> > function, like this:
>> >>> >
>> >>> > bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>> >>> >     let run x = (runAsyncT x) Nothing stp yld
>> >>> >         yield a _ Nothing  = run $ f a
>> >>> >         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>> >>> >     in m Nothing stp yield
>> >>> >
>> >>> > And then the bind function becomes:
>> >>> >
>> >>> > (>>=) = bindWith (<>)
>> >>> >
>> >>> > But this leads to a performance degradation of more than 10%.
>> inlining
>> >>> > does
>> >>> > not help, I tried INLINE pragma as well as the "inline" GHC
>> builtin. I
>> >>> > thought this should be a more or less straightforward replacement
>> >>> > making the
>> >>> > second version equivalent to the first one. But apparently there is
>> >>> > something going on here that makes it perform worse.
>> >>> >
>> >>> > I did not look at the core, stg or asm yet. Hoping someone can
>> quickly
>> >>> > comment on it. Any ideas why is it so? Can this be worked around
>> >>> > somehow?
>> >>> >
>> >>> > Thanks,
>> >>> > Harendra
>> >>> >
>> >>>
>> >>> > _______________________________________________
>> >>> > ghc-devs mailing list
>> >>> > [email protected]
>> >>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> > _______________________________________________
>> > ghc-devs mailing list
>> > [email protected]
>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>> >
>>
>
>

_______________________________________________
ghc-devs mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: Performance degradation when factoring out common code

Reply via email to