https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492
Bug ID: 65492 Summary: Bad optimization in -O3 on SSE intrinsics Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com After investigating a loop using SSE intrinsics that was significantly faster in clang than in gcc, I discovered gcc had the same performance as clang in -O2, and only performed signficantly worse in -O3. Disabling all the switches mentioned in the documentation as activates by -O3 (or enabling them for -O2), doesn't fully account for the difference, but the switch -f(no-)tree-loop-vectorize accounts for roughly half of it. I have attached the files I used to test it. Using gcc -O2 or clang -O2 or -O3, it times in at 1.8s on my machine. Using g++ (4.9 or 5.0) -O3 it times in at 2.5s. Using -O3 -fno-tree-loop-vectorize it runs in 2.3s, and -O2 -ftree-vectorize at 2.25s. Using callgrind, it seems the performance difference is mainly spend on the accessing integers in the vector union.