[Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt

M8R-ynb11d at mailinator dot com Thu, 01 Mar 2012 23:03:57 -0800

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459


             Bug #: 52459
           Summary: [x86] loop vectorization performance very bad (worse
                    than -O0) when using sse4.2 popcnt
    Classification: Unclassified
           Product: gcc
           Version: 4.6.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: m8r-ynb...@mailinator.com


Created attachment 26808
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808
testcase

gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge)

The attached testcase simply exercises the popcnt instruction over every
unsigned int and creates a histogram.  But with -O2 -ftree-vectorize or with
-O3, the vectorizer adds two popcnt instructions per loop iteration, which
makes performance worse than the unoptimized version, and about 3x slower than
-Os.

Here's the timings and the resulting asm of the loop:

With -O0 -m32 -msse4.2: [7.40 seconds]
.L2:
    mov    eax, DWORD PTR [ebp-12]
    add    DWORD PTR [ebp-12], 1
    popcnt    eax, eax
    mov    edx, DWORD PTR [ebp-144+eax*4]
    add    edx, 1
    mov    DWORD PTR [ebp-144+eax*4], edx
    cmp    DWORD PTR [ebp-12], 0
    jne    .L2


With -O1 -m32 -msse4.2: [2.90 seconds]
.L2:
    lea    edx, [eax+1]
    popcnt    eax, eax
    add    DWORD PTR [esp+12+eax*4], 1
    mov    eax, edx
    test    edx, edx
    jne    .L2


With -O2 -m32 -msse4.2: [2.91 seconds]
.L5:
    popcnt    edx, eax
    mov    ecx, DWORD PTR [esp+12+edx*4]
    add    eax, 1
.L3:
    add    ecx, 1
    test    eax, eax
    mov    DWORD PTR [esp+12+edx*4], ecx
    jne    .L5


With -Os -m32 -msse4.2: [2.82 seconds]
.L2:
    popcnt    edx, eax
    inc    DWORD PTR [ebp-136+edx*4]
    inc    eax
    jne    .L2


With -O3 -m32 -msse4.2: [8.45 seconds]
.L5:
    popcnt    edx, eax
    mov    edx, DWORD PTR [esp+edx*4]
.L3:
    popcnt    ecx, eax
    add    edx, 1
    add    eax, 1
    mov    DWORD PTR [esp+ecx*4], edx
    jne    .L5


Things are about the same (relatively) with -m64 but somewhat slower, I'm
assuming due to the extra edx -> rdx sign extension step.

[Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt

Reply via email to