https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

            Bug ID: 99971
           Summary: GCC generates partially vectorized and scalar code at
                    once
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Consider the following code sample:

struct A
{
    unsigned int a, b, c, d;

    A& operator+= (A const& that)
    {
        a += that.a;
        b += that.b;
        c += that.c;
        d += that.d;
        return *this;
    }

    A& operator-= (A const& that)
    {
        a -= that.a;
        b -= that.b;
        c -= that.c;
        d -= that.d;
        return *this;
    }
};

void test(A& x, A const& y1, A const& y2)
{
    x += y1;
    x -= y2;
}

The code, when compiled with options "-O3 -march=nehalem", generates:

test(A&, A const&, A const&):
        pushq   %rbp
        movdqu  (%rdi), %xmm1
        pushq   %rbx
        movl    4(%rsi), %r8d
        movdqu  (%rsi), %xmm0
        movl    (%rsi), %r9d
        paddd   %xmm1, %xmm0
        movl    8(%rsi), %ecx
        movl    12(%rsi), %eax
        movl    %r8d, %esi
        movl    (%rdi), %ebp
        movl    4(%rdi), %ebx
        movl    8(%rdi), %r11d
        movl    12(%rdi), %r10d
        movups  %xmm0, (%rdi)
        subl    (%rdx), %r9d
        subl    4(%rdx), %esi
        subl    8(%rdx), %ecx
        subl    12(%rdx), %eax
        addl    %ebp, %r9d
        addl    %ebx, %esi
        movl    %r9d, (%rdi)
        popq    %rbx
        addl    %r11d, %ecx
        popq    %rbp
        movl    %esi, 4(%rdi)
        addl    %r10d, %eax
        movl    %ecx, 8(%rdi)
        movl    %eax, 12(%rdi)
        ret

https://gcc.godbolt.org/z/Mzchj8bxG

Here you can see that the compiler has partially vectorized the test function -
it converted "x += y1" to paddd, as expected, but failed to vectorize "x -=
y2". But at the same time the compiler also generated scalar code, including
for the already vectorized "x += y1" line, basically duplicating it.

Note that when either "x += y1" or "x -= y2" is commented, the compiler is able
to vectorize the line that is left. It is also able to vectorize both lines
when the += and -= operators are applied to different objects instead of x.

This is reproducible since gcc 8 up to and including 10.2. gcc 7 doesn't
vectorize this code. With the current trunk on godbolt the generated code is
different:

test(A&, A const&, A const&):
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        paddd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        movd    %xmm0, %eax
        subl    (%rdx), %eax
        movl    %eax, (%rdi)
        pextrd  $1, %xmm0, %eax
        subl    4(%rdx), %eax
        movl    %eax, 4(%rdi)
        pextrd  $2, %xmm0, %eax
        subl    8(%rdx), %eax
        movl    %eax, 8(%rdi)
        pextrd  $3, %xmm0, %eax
        subl    12(%rdx), %eax
        movl    %eax, 12(%rdi)
        ret

Here the compiler is able to vectorize "x += y1" but not "x -= y2". At least,
it removed the duplicate scalar version of "x += y1".

Given that the compiler is able to vectorize each line in isolation, I would
expect it to be able to vectorize them combined. Generating duplicate versions
of code is certainly not expected.

Reply via email to