Re: [PATCH] loop distribution bug fix

2016-11-14 Thread Richard Biener
On Fri, Nov 11, 2016 at 7:55 PM, Jim Wilson  wrote:
> On Thu, Nov 10, 2016 at 2:53 AM, Richard Biener
>  wrote:
>> The biggest "lack" of loop distribution is the ability to undo CSE so for
>
> I hadn't noticed this problem yet.  I will have to take a look.
>
>> Then of course the cost model is purely modeled for STREAM (reduce the number
>> of memory streams).  So loop distribution is expected to pessimize code for
>> the CSE case in case you are not memory bound and improve things if you
>> are memory bound.
>
> I noticed this problem.  I think loop distribution should be callable
> from inside the vectorizer or vice versa.  if a loop can't be
> vectorized, but distributing the loop allows the sub loops to be
> vectorized, then we should go ahead and ditsribute, even if that
> increases the number of memory streams slightly, as the gain from
> vectorizing should be greater than the loss from the additional memory
> streams.  We could have a cost model that tries to compute the
> gain/loss here and make a better decision of when to distribute to
> increase vectorization at the expense of the number of memory streams.
> This looks like a major project though, and may be more work than I
> have time for.

Yes.  That's true for most enabling transforms (an easier one that comes to
my mind is final value replacement, which, when required from the vectorizer
could use a different cost model).

Richard.

> Jim


Re: [PATCH] loop distribution bug fix

2016-11-11 Thread Jim Wilson
On Thu, Nov 10, 2016 at 2:53 AM, Richard Biener
 wrote:
> The biggest "lack" of loop distribution is the ability to undo CSE so for

I hadn't noticed this problem yet.  I will have to take a look.

> Then of course the cost model is purely modeled for STREAM (reduce the number
> of memory streams).  So loop distribution is expected to pessimize code for
> the CSE case in case you are not memory bound and improve things if you
> are memory bound.

I noticed this problem.  I think loop distribution should be callable
from inside the vectorizer or vice versa.  if a loop can't be
vectorized, but distributing the loop allows the sub loops to be
vectorized, then we should go ahead and ditsribute, even if that
increases the number of memory streams slightly, as the gain from
vectorizing should be greater than the loss from the additional memory
streams.  We could have a cost model that tries to compute the
gain/loss here and make a better decision of when to distribute to
increase vectorization at the expense of the number of memory streams.
This looks like a major project though, and may be more work than I
have time for.

Jim


Re: [PATCH] loop distribution bug fix

2016-11-10 Thread Richard Biener
On Thu, Nov 10, 2016 at 6:25 AM, Jim Wilson  wrote:
> This fixes a bug in code adding edges to the dependence graph. Values
> for this_dir can be -1 (backward edge), 0 (no edge), 1 (forward edge),
> and 2 (both backward and forward edges). There can be multiple
> dependencies checked, creating multiple edges that have to be merged
> together. this_dir contains the current edge. dir contains the
> previous edges. The code fails to handle the case where this_dir is 2,
> in which case we can return immediately. This is a minor optimization
> to improve compile time. The code handles the case where dir is
> non-zero and this_dir is zero by returning 2, which is incorrect. dir
> should be unmodified in this case. We can return 2 only if both dir
> and this_dir are non-zero and unequal, i.e. one is -1 and the other is
> 1. This problem creates extra unnecessary edges, which can prevent
> loops from being distributed. The patch fixes both problems.
>
> This passed an x86_64 gcc bootstrap with -ftree-loop-distribution
> added to BOOT_CFLAGS and a testsuite regression check.  Curiously, I
> see that I get different results for the C/C++ ubsan tests every time
> I run them, but this has nothing to do with my patch, as it happens
> with or without my patch.  I haven't tried debugging this yet.  Might
> be related to my recent upgrade to Ubuntu 16.04 LTS.  Otherwise, there
> are no regressions.
>
> On SPEC CPU2006, on aarch64, I see 5879 loops distributed without the
> patch, and 5906 loops distributed with the patch. So 27 extra loops
> are distributed which is about 0.5% more loop distributions. There is
> no measurable performance gain from the bug fix on the CPU2006 run
> time though I plan to spend some more time looking at this code to see
> if I can find other improvements.

The biggest "lack" of loop distribution is the ability to undo CSE so for

 for (;;)
  {
 a[i] = a[i] + 1;
 b[i] = a[i];
  }

CSE makes us see

  for (;;)
{
   tem = a[i];
   tem2 = tem + 1;
   a[i] = tem;
   b[i] = tem;
}

and loop distribution cannot re-materialize tem3 from a[i] thus most of the
time it ends up pulling redundant computations into each partition (sometimes
that can reduce memory bandwith if one less stream is used but sometimes
not, like in the above case).

Then of course the cost model is purely modeled for STREAM (reduce the number
of memory streams).  So loop distribution is expected to pessimize code for
the CSE case in case you are not memory bound and improve things if you
are memory bound.

> OK?

Ok.

Thanks for the improvement!
Richard.

> Jim