[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 Andrew Pinski changed: What|Removed |Added Assignee|pinskia at gcc dot gnu.org |unassigned at gcc dot gnu.org Status|ASSIGNED|NEW --- Comment #11 from Andrew Pinski --- No longer working on this.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 Andrew Pinski changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |pinskia at gcc dot gnu.org --- Comment #10 from Andrew Pinski --- the bit-insert-expr issue is mine for GCC 13.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #9 from Thomas Koenig --- The generated code for the loop seems to be on par with what clang and icc do, so that part is fixed. Initialization is strange for icc. For clang, it really quite short: foo:# @foo .cfi_startproc # BB#0: vxorps %ymm2, %ymm2, %ymm2 vmovapd .LCPI0_0(%rip), %ymm8 # ymm8 = [4.00e+00,4.00e+00,4.00e+00,4.00e+00] vmovapd %ymm1, %ymm4 vmovapd %ymm0, %ymm5 .p2align4, 0x90 .LBB0_1:# =>This I vs. gcc: foo: .LFB0: .cfi_startproc vmovsd .LC0(%rip), %xmm2 vmovapd %ymm1, %ymm7 vpxor %xmm5, %xmm5, %xmm5 vmovq %xmm2, %xmm9 vmulpd %ymm1, %ymm1, %ymm10 vmovapd %xmm9, %xmm9 vunpcklpd %xmm2, %xmm9, %xmm3 vinsertf128 $0x0, %xmm3, %ymm9, %ymm9 vextractf128$0x1, %ymm9, %xmm3 vmovsd %xmm2, %xmm3, %xmm3 vinsertf128 $0x1, %xmm3, %ymm9, %ymm9 vextractf128$0x1, %ymm9, %xmm3 vunpcklpd %xmm2, %xmm3, %xmm3 vmovsd .LC1(%rip), %xmm2 vmovq %xmm2, %xmm8 vinsertf128 $0x1, %xmm3, %ymm9, %ymm9 vmovapd %xmm8, %xmm8 vunpcklpd %xmm2, %xmm8, %xmm3 vinsertf128 $0x0, %xmm3, %ymm8, %ymm8 vextractf128$0x1, %ymm8, %xmm3 vmovsd %xmm2, %xmm3, %xmm3 vinsertf128 $0x1, %xmm3, %ymm8, %ymm8 vextractf128$0x1, %ymm8, %xmm3 vunpcklpd %xmm2, %xmm3, %xmm3 vinsertf128 $0x1, %xmm3, %ymm8, %ymm8 vmovapd %ymm0, %ymm3 jmp .L3 .p2align 4,,10 .p2align 3 .L13:
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #8 from Marc Glisse --- Thomas, the code generated by gcc has changed (after some patches by Jakub IIRC). Do you consider the issue fixed or is the generated asm still problematic? .L13: vpextrq $1, %xmm2, %rax testq %rax, %rax je .L2 vextractf128$0x1, %ymm2, %xmm2 vmovq %xmm2, %rax testq %rax, %rax jne .L2 vpextrq $1, %xmm2, %rax vmovapd %ymm4, %ymm3 testq %rax, %rax jne .L2 .L3: vmulpd %ymm3, %ymm3, %ymm4 vmulpd %ymm8, %ymm3, %ymm3 vsubpd %ymm10, %ymm4, %ymm4 vmulpd %ymm9, %ymm3, %ymm3 vaddpd %ymm0, %ymm4, %ymm4 vaddpd %ymm1, %ymm3, %ymm9 vaddpd %ymm4, %ymm4, %ymm2 vmulpd %ymm9, %ymm9, %ymm10 vaddpd %ymm10, %ymm2, %ymm2 vcmpltpd%ymm7, %ymm2, %ymm2 vpaddq %xmm2, %xmm5, %xmm3 vextractf128$1, %ymm2, %xmm6 vmovq %xmm2, %rax vextractf128$1, %ymm5, %xmm5 testq %rax, %rax vpaddq %xmm6, %xmm5, %xmm5 vinsertf128 $0x1, %xmm5, %ymm3, %ymm5 jne .L13
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #7 from Marc Glisse --- Created attachment 41331 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41331&action=edit recognize a VEC_CONCAT from a constructor (not clean) One piece of the issue is v4di = { v2di, v2di } where we currently generate vmovdqa %xmm3, -48(%rsp) vmovdqa %xmm5, -32(%rsp) vmovdqa -48(%rsp), %ymm0 and the attached patch generates vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 I am not very familiar with expansion and RTL, the patch probably has many issues. I don't know if there is something significantly more general to try. I was tempted to cast (aka subreg) V2DI to TI, construct a V2TI, and cast back to V4DI, since the code nearby is supposed to handle constructors with only scalar elements, but an experiment with __int128 seems to indicate that we don't discover vec_concat for scalars either :-(
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #6 from Jakub Jelinek --- For the BIT_INSERT_EXPR foldings, the question is if we want to handle them one by one even when there is undefined value in the first one, or if in that case we should fold only if we eliminate all the undefined values from there and get a constant VECTOR_CST.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #5 from Richard Biener --- Best split this bug into BIT_INSERT_EXPR foldings (yeah, those mentioned are missing) and the ira/reload issue.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-02-24 Ever confirmed|0 |1 --- Comment #4 from Marc Glisse --- (In reply to Marc Glisse from comment #2) > In reload, subregs are extracted via the stack, whereas the low subreg > should already be available (NOP) and the high one can be extracted by a > single insn. That's probably the first thing to investigate. (-mtune doesn't > change what happens) To concentrate on this, with -O3 -mavx : typedef long int v4i __attribute__((vector_size (32))); v4i foo(v4i a, v4i b) { return a+b; } vmovdqa %ymm0, -80(%rbp) vmovdqa %ymm1, -112(%rbp) vmovdqa -80(%rbp), %xmm4 vmovdqa -64(%rbp), %xmm6 vpaddq -112(%rbp), %xmm4, %xmm3 vpaddq -96(%rbp), %xmm6, %xmm5 vmovaps %xmm3, -48(%rbp) vmovaps %xmm5, -32(%rbp) vmovdqa -48(%rbp), %ymm0 (plus overhead to align the stack, etc) compared to clang's vextractf128$1, %ymm0, %xmm2 vextractf128$1, %ymm1, %xmm3 vpaddq %xmm2, %xmm3, %xmm2 vpaddq %xmm0, %xmm1, %xmm0 vinsertf128 $1, %xmm2, %ymm0, %ymm0
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #3 from Marc Glisse --- (In reply to Marc Glisse from comment #2) > We have trouble with your VSET macro (known issue): > two_28 = BIT_INSERT_EXPR ; > two_29 = BIT_INSERT_EXPR ; > two_30 = BIT_INSERT_EXPR ; > two_31 = BIT_INSERT_EXPR ; > it is easier for gcc if you write: > v4do two={2,2,2,2}; > or you could even replace two with 2 in the expressions, gcc handles it just > fine. This part is not at all central to this PR, but we are really missing optimizations on the new BIT_INSERT_EXPR. typedef long vec __attribute__((vector_size(16))); vec f(){ long l; vec v={l,l}; v[0]=0; v[1]=0; return v; } _1 = {l_2(D), l_2(D)}; v_4 = BIT_INSERT_EXPR <_1, 0, 0 (64 bits)>; v_5 = BIT_INSERT_EXPR ; so gcc could replace {l,l} with {0,l} after the first bit_insert_expr (_1 has a single use), and then {0,0} after the second, one element at a time, the easy case (though constructors are always tricky).
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #2 from Marc Glisse --- We have trouble with your VSET macro (known issue): two_28 = BIT_INSERT_EXPR ; two_29 = BIT_INSERT_EXPR ; two_30 = BIT_INSERT_EXPR ; two_31 = BIT_INSERT_EXPR ; it is easier for gcc if you write: v4do two={2,2,2,2}; or you could even replace two with 2 in the expressions, gcc handles it just fine. In reload, subregs are extracted via the stack, whereas the low subreg should already be available (NOP) and the high one can be extracted by a single insn. That's probably the first thing to investigate. (-mtune doesn't change what happens) res could be kept in a register (or even better a pair of registers) through the whole loop.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization, ra --- Comment #1 from Andrew Pinski --- Does -march=intel fix the issue? I am suspecting this is just a tuning issue where the default (generic) tuning turns on an option which is needed for better performance on some AMD machines.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 Thomas Koenig changed: What|Removed |Added Target||x86_64-pc-linux-gnu Severity|normal |enhancement