http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
Bug #: 52459 Summary: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt Classification: Unclassified Product: gcc Version: 4.6.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: m8r-ynb...@mailinator.com Created attachment 26808 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808 testcase gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge) The attached testcase simply exercises the popcnt instruction over every unsigned int and creates a histogram. But with -O2 -ftree-vectorize or with -O3, the vectorizer adds two popcnt instructions per loop iteration, which makes performance worse than the unoptimized version, and about 3x slower than -Os. Here's the timings and the resulting asm of the loop: With -O0 -m32 -msse4.2: [7.40 seconds] .L2: mov eax, DWORD PTR [ebp-12] add DWORD PTR [ebp-12], 1 popcnt eax, eax mov edx, DWORD PTR [ebp-144+eax*4] add edx, 1 mov DWORD PTR [ebp-144+eax*4], edx cmp DWORD PTR [ebp-12], 0 jne .L2 With -O1 -m32 -msse4.2: [2.90 seconds] .L2: lea edx, [eax+1] popcnt eax, eax add DWORD PTR [esp+12+eax*4], 1 mov eax, edx test edx, edx jne .L2 With -O2 -m32 -msse4.2: [2.91 seconds] .L5: popcnt edx, eax mov ecx, DWORD PTR [esp+12+edx*4] add eax, 1 .L3: add ecx, 1 test eax, eax mov DWORD PTR [esp+12+edx*4], ecx jne .L5 With -Os -m32 -msse4.2: [2.82 seconds] .L2: popcnt edx, eax inc DWORD PTR [ebp-136+edx*4] inc eax jne .L2 With -O3 -m32 -msse4.2: [8.45 seconds] .L5: popcnt edx, eax mov edx, DWORD PTR [esp+edx*4] .L3: popcnt ecx, eax add edx, 1 add eax, 1 mov DWORD PTR [esp+ecx*4], edx jne .L5 Things are about the same (relatively) with -m64 but somewhat slower, I'm assuming due to the extra edx -> rdx sign extension step.