http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908
--- Comment #2 from Uros Bizjak <ubizjak at gmail dot com> 2012-04-09 11:48:05 UTC --- Created attachment 27117 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27117 Proposed patch There are indeed two problems with XOP patterns: a) duplication of *sse4_1_mulv2siv2di3 pattern b) wrong order of operands in all (!!!) XOP patterns. XOP patterns consider element 0 as MSB. Attached patch solves this by simply removing fake xop_mulv2div2di3_{low,high} patterns, expanding to (fixed) xop_pmacsdq{h,l} patterns directly. There is simply no need to use vpmacsdql instead of vpmuldq. For consistency, the patch expands to xop_pmacsdql pattern, but gcc figures out that addition of 0 is unneeded and substitutes MAC insn with plain MUL. Attached patch does not even try to fix other intrinsics. Someone familiar with AMD documentation should review all these, since the documentation (43479.pdf) is somehow inconsistent (i.e. the figure that explains VPMADCSSWD is inconsistent with the description). Since I don't have XOP processor, I can only eyeball the asm, in this case: vpxor %xmm3, %xmm3, %xmm3 xorl %eax, %eax .L3: vpshufd $216, c2(%rax), %xmm1 vpshufd $216, c3(%rax), %xmm0 vpmuldq %xmm0, %xmm1, %xmm2 vpmacsdqh %xmm3, %xmm0, %xmm1, %xmm0 vmovdqa %xmm2, e1(%rax,%rax) vmovdqa %xmm0, e1+16(%rax,%rax) addq $16, %rax cmpq $2048, %rax jne .L3 Please also note hoisting of constant load.