RE: [PATCH] Remove inefficient branchless conditional negate optimization

Wilco Dijkstra Fri, 27 Feb 2015 10:39:44 -0800

> Richard Biener wrote: 
> On Thu, Feb 26, 2015 at 11:20 PM, Jeff Law <l...@redhat.com> wrote:
> > On 02/26/15 10:30, Wilco Dijkstra wrote:
> >>
> >> Several GCC versions ago a conditional negate optimization was introduced
> >> as a workaround for
> >> PR45685. However the branchless expansion for conditional negate is
> >> extremely inefficient on most
> >> targets (5 sequentially dependent instructions rather than 2 on AArch64).
> >> Since the underlying issue
> >> has been resolved (the example in PR45685 no longer generates a branch on
> >> x64), remove the
> >> workaround so that conditional negates are treated in exactly the same way
> >> as conditional invert,
> >> add, subtract, and, orr, xor etc. Simple example:
> >>
> >> int f(int x) { if (x > 3) x = -x; return x; }
> >
> > You need to bootstrap and regression test the change before it can be
> > approved.
> 
> As Jeff added a testcase for the PHI opt transform to happen I'm sure
> testing would shown this as fallout.


Yes that's the only test that starts to fail. I've changed it to scan
for cmov/csel instead. Bootstrap is fine for AArch64 and x64 of course.

Should the test be moved somewhere else now it is no longer a tree-ssa test?

> > You should turn this little example into a testcase.  It's fine with me if
> > this new test is ARM specific.
> >
> >
> > You should also find a way to change the test gcc.dg/tree-ssa/pr45685.c in
> > such a way that it ensures there aren't any undesirable branches.
>
> I'd be also interested in results of vectorizing a loop with a
> conditional negate.
> I can very well imagine reverting this patch causing code quality regressions
> there.

Well vectorized code improves in the same way as you'd expect:

void f(int *p, int *q) 
{ 
  int i; 
  for (i = 0; i < 1000; i++) p[i] = (q[i] > 3) ? -q[i] : q[i]; 
}

Before:
.L6:
        vmovdqa (%r9,%rax), %ymm2
        addl    $1, %edx
        vpcmpgtd        %ymm5, %ymm2, %ymm0
        vpand   %ymm4, %ymm0, %ymm1
        vpsubd  %ymm1, %ymm3, %ymm0
        vpxor   %ymm0, %ymm2, %ymm0
        vpaddd  %ymm0, %ymm1, %ymm0
        vmovups %xmm0, (%rcx,%rax)
        vextracti128    $0x1, %ymm0, 16(%rcx,%rax)
        addq    $32, %rax
        cmpl    %r8d, %edx
        jb      .L6

After:
.L6:
        vmovdqa (%r9,%rax), %ymm0
        addl    $1, %edx
        vpcmpgtd        %ymm4, %ymm0, %ymm2
        vpsubd  %ymm0, %ymm3, %ymm1
        vpblendvb       %ymm2, %ymm1, %ymm0, %ymm0
        vmovups %xmm0, (%rcx,%rax)
        vextracti128    $0x1, %ymm0, 16(%rcx,%rax)
        addq    $32, %rax
        cmpl    %r8d, %edx
        jb      .L6

> > I've got enough history to know this is fixing a regression of sorts for the
> > ARM platform.  So once the issues above are addressed it can go forward even
> > without a BZ noting the regression.
> 
> But I'd say this is stage1 material at this point.

I suppose it doesn't matter as it'll have to be backported anyway.

Wilco

RE: [PATCH] Remove inefficient branchless conditional negate optimization

Reply via email to