[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550

rguenther at suse dot de Fri, 07 Apr 2017 00:57:08 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390

--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 7 Apr 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390
> 
> --- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Has somebody the benchmark around to retry with current trunk, with
> -f{,no-}split-paths and compare that to some older trunk and gcc6?

On a broadwell machine I get (-O3 -march=native)

gcc6: 5507.42 Mflops
gcc7: 4787.26 Mflops
gcc7: 5435.08 Mflops [-fno-split-paths]

so the RTL if-conversion works now unless inhibited by path splitting.

What path splitting does is mostly undone by loop disambiguation which
re-creates the merger so path splitting just made the loop multiple
exit (without simplifying the duplicated exit condition).

So we can add more heuristics to tame down loop splitting, for example
never duplicating a joiner that has an exit.  Or adding to the
quite stupid if-cvt mitigation code (missing the minmax case).

Or add even more outs to the threading opportunity detection code...
We currently find that

  t_175 = PHI <t_184(6), ab_177(7)>

in the merger exposes a threading opportunity because it has one
arg that is unchanged over the latch (t_184 over 6->8) and it has
a use in the threading destination (in the controlling condition
even).

This all just exposes that path splitting is not well integrated
into what it tries to expose (threading).  IMHO it should have been
part of backwards/forward threading.

But that ship has sailed (Jeff approved it).

I've tried to fixup after the MIA authors.  But well.

I can fixup by removing the pass again.  Or adding more oddball
heuristics.  This case which seems important for x86_64 is

        for (i=j+1; i<M; i++)
        {
            double ab = fabs(A[i][j]);
            if ( ab > t)
            {
                jp = i;
                t = ab;
            }
        }

so reducing MAX plus remembering the index of the maximum value.
We're not phiopt-ing that to MAX because it might not be profitable
(the condition has to remain).  So path splitting could be profitable
on some archs.  IFF we wouldn't re-create that shared latch
right afterwards anyway (and forget to propagate single-arg PHIs
resulting from the BB duplication).

[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550

Reply via email to