[Bug target/52572] suboptimal assignment to avx element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement Last reconfirmed||2021-12-25 Target||x86_64-linux-gnu Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #4 from Andrew Pinski --- LLVM produces: vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] and vxorps %xmm0, %xmm0, %xmm0 vblendps$252, (%rdi), %ymm0, %ymm0 # ymm0 = ymm0[0,1],mem[2,3,4,5,6,7] Which I suspect is better.
[Bug target/52572] suboptimal assignment to avx element
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #1 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-13 07:54:14 UTC --- Have you actually tried that? Mixing VEX encoded insns with legacy encoded SSE* insns is very costly, for good performance there needs to be a vzeroupper in between (but then you lose the upper bits). See e.g. 2.8 in the AVX Programming Reference.
[Bug target/52572] suboptimal assignment to avx element
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572 --- Comment #2 from Marc Glisse marc.glisse at normalesup dot org 2012-03-13 08:16:58 UTC --- (In reply to comment #1) Have you actually tried that? Ah, no, sorry, I only have occasional access to such a machine to benchmark the code. From a -Os perspective it is still shorter (but indeed that matters less to me than -O3 performance). Mixing VEX encoded insns with legacy encoded SSE* insns is very costly, for good performance there needs to be a vzeroupper in between (but then you lose the upper bits). See e.g. 2.8 in the AVX Programming Reference. Thanks, I'd missed that. The vblendpd solution should still apply (from the initial 'v' it sounds safe), no?
[Bug target/52572] suboptimal assignment to avx element
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572 --- Comment #3 from Marc Glisse marc.glisse at normalesup dot org 2012-03-13 17:57:58 UTC --- Or for this variant: __m256d f(__m256d *y){ __m256d x=*y; x[0]=0; // or x[3] return x; } it looks like vmaskmovpd could replace: vmovapd(%rdi), %ymm0 vmovapd%xmm0, %xmm1 vmovlpd.LC0(%rip), %xmm1, %xmm1 vinsertf128$0x0, %xmm1, %ymm0, %ymm0 (I tried a version with __builtin_shuffle but it wouldn't generate vmaskmovpd either) (sorry for the naive suggestions, there are too many possibilities to optimize them all...)