subject:"\[Bug target\/79709\] Subobtimal code with \-mavx and explicit vector"

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2024-07-30 Thread pinskia at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

Andrew Pinski  changed:

   What|Removed |Added

   Assignee|pinskia at gcc dot gnu.org |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW

--- Comment #11 from Andrew Pinski  ---
No longer working on this.

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2021-08-19 Thread pinskia at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

Andrew Pinski  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |pinskia at gcc dot 
gnu.org

--- Comment #10 from Andrew Pinski  ---
the bit-insert-expr issue is mine for GCC 13.

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-10-07 Thread tkoenig at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

--- Comment #9 from Thomas Koenig  ---
The generated code for the loop seems to be on par with what
clang and icc do, so that part is fixed.

Initialization is strange for icc. For clang, it really quite short:

foo:# @foo
.cfi_startproc
# BB#0:
vxorps  %ymm2, %ymm2, %ymm2
vmovapd .LCPI0_0(%rip), %ymm8   # ymm8 =
[4.00e+00,4.00e+00,4.00e+00,4.00e+00]
vmovapd %ymm1, %ymm4
vmovapd %ymm0, %ymm5
.p2align4, 0x90
.LBB0_1:# =>This I

vs. gcc:

foo:
.LFB0:
.cfi_startproc
vmovsd  .LC0(%rip), %xmm2
vmovapd %ymm1, %ymm7
vpxor   %xmm5, %xmm5, %xmm5
vmovq   %xmm2, %xmm9
vmulpd  %ymm1, %ymm1, %ymm10
vmovapd %xmm9, %xmm9
vunpcklpd   %xmm2, %xmm9, %xmm3
vinsertf128 $0x0, %xmm3, %ymm9, %ymm9
vextractf128$0x1, %ymm9, %xmm3
vmovsd  %xmm2, %xmm3, %xmm3
vinsertf128 $0x1, %xmm3, %ymm9, %ymm9
vextractf128$0x1, %ymm9, %xmm3
vunpcklpd   %xmm2, %xmm3, %xmm3
vmovsd  .LC1(%rip), %xmm2
vmovq   %xmm2, %xmm8
vinsertf128 $0x1, %xmm3, %ymm9, %ymm9
vmovapd %xmm8, %xmm8
vunpcklpd   %xmm2, %xmm8, %xmm3
vinsertf128 $0x0, %xmm3, %ymm8, %ymm8
vextractf128$0x1, %ymm8, %xmm3
vmovsd  %xmm2, %xmm3, %xmm3
vinsertf128 $0x1, %xmm3, %ymm8, %ymm8
vextractf128$0x1, %ymm8, %xmm3
vunpcklpd   %xmm2, %xmm3, %xmm3
vinsertf128 $0x1, %xmm3, %ymm8, %ymm8
vmovapd %ymm0, %ymm3
jmp .L3
.p2align 4,,10
.p2align 3
.L13:

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-09-29 Thread glisse at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

--- Comment #8 from Marc Glisse  ---
Thomas, the code generated by gcc has changed (after some patches by Jakub
IIRC). Do you consider the issue fixed or is the generated asm still
problematic?

.L13:
vpextrq $1, %xmm2, %rax
testq   %rax, %rax
je  .L2
vextractf128$0x1, %ymm2, %xmm2
vmovq   %xmm2, %rax
testq   %rax, %rax
jne .L2
vpextrq $1, %xmm2, %rax
vmovapd %ymm4, %ymm3
testq   %rax, %rax
jne .L2
.L3:
vmulpd  %ymm3, %ymm3, %ymm4
vmulpd  %ymm8, %ymm3, %ymm3
vsubpd  %ymm10, %ymm4, %ymm4
vmulpd  %ymm9, %ymm3, %ymm3
vaddpd  %ymm0, %ymm4, %ymm4
vaddpd  %ymm1, %ymm3, %ymm9
vaddpd  %ymm4, %ymm4, %ymm2
vmulpd  %ymm9, %ymm9, %ymm10
vaddpd  %ymm10, %ymm2, %ymm2
vcmpltpd%ymm7, %ymm2, %ymm2
vpaddq  %xmm2, %xmm5, %xmm3
vextractf128$1, %ymm2, %xmm6
vmovq   %xmm2, %rax
vextractf128$1, %ymm5, %xmm5
testq   %rax, %rax
vpaddq  %xmm6, %xmm5, %xmm5
vinsertf128 $0x1, %xmm5, %ymm3, %ymm5
jne .L13

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-05-06 Thread glisse at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

--- Comment #7 from Marc Glisse  ---
Created attachment 41331
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41331&action=edit
recognize a VEC_CONCAT from a constructor (not clean)

One piece of the issue is v4di = { v2di, v2di } where we currently generate
vmovdqa %xmm3, -48(%rsp)
vmovdqa %xmm5, -32(%rsp)
vmovdqa -48(%rsp), %ymm0
and the attached patch generates
vinsertf128 $0x1, %xmm1, %ymm0, %ymm0

I am not very familiar with expansion and RTL, the patch probably has many
issues. I don't know if there is something significantly more general to try. I
was tempted to cast (aka subreg) V2DI to TI, construct a V2TI, and cast back to
V4DI, since the code nearby is supposed to handle constructors with only scalar
elements, but an experiment with __int128 seems to indicate that we don't
discover vec_concat for scalars either :-(

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-27 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #6 from Jakub Jelinek  ---
For the BIT_INSERT_EXPR foldings, the question is if we want to handle them one
by one even when there is undefined value in the first one, or if in that case
we should fold only if we eliminate all the undefined values from there and get
a constant VECTOR_CST.

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-27 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

--- Comment #5 from Richard Biener  ---
Best split this bug into BIT_INSERT_EXPR foldings (yeah, those mentioned are
missing) and the ira/reload issue.

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-24 Thread glisse at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

Marc Glisse  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-02-24
 Ever confirmed|0   |1

--- Comment #4 from Marc Glisse  ---
(In reply to Marc Glisse from comment #2)
> In reload, subregs are extracted via the stack, whereas the low subreg
> should already be available (NOP) and the high one can be extracted by a
> single insn. That's probably the first thing to investigate. (-mtune doesn't
> change what happens)

To concentrate on this, with -O3 -mavx :
typedef long int v4i __attribute__((vector_size (32)));
v4i foo(v4i a, v4i b) { return a+b; }

vmovdqa %ymm0, -80(%rbp)
vmovdqa %ymm1, -112(%rbp)
vmovdqa -80(%rbp), %xmm4
vmovdqa -64(%rbp), %xmm6
vpaddq  -112(%rbp), %xmm4, %xmm3
vpaddq  -96(%rbp), %xmm6, %xmm5
vmovaps %xmm3, -48(%rbp)
vmovaps %xmm5, -32(%rbp)
vmovdqa -48(%rbp), %ymm0
(plus overhead to align the stack, etc)

compared to clang's

vextractf128$1, %ymm0, %xmm2
vextractf128$1, %ymm1, %xmm3
vpaddq  %xmm2, %xmm3, %xmm2
vpaddq  %xmm0, %xmm1, %xmm0
vinsertf128 $1, %xmm2, %ymm0, %ymm0

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-24 Thread glisse at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

--- Comment #3 from Marc Glisse  ---
(In reply to Marc Glisse from comment #2)
> We have trouble with your VSET macro (known issue):
>   two_28 = BIT_INSERT_EXPR ;
>   two_29 = BIT_INSERT_EXPR ;
>   two_30 = BIT_INSERT_EXPR ;
>   two_31 = BIT_INSERT_EXPR ;
> it is easier for gcc if you write:
>   v4do two={2,2,2,2};
> or you could even replace two with 2 in the expressions, gcc handles it just
> fine.

This part is not at all central to this PR, but we are really missing
optimizations on the new BIT_INSERT_EXPR.

typedef long vec __attribute__((vector_size(16)));
vec f(){
  long l;
  vec v={l,l};
  v[0]=0;
  v[1]=0;
  return v;
}

  _1 = {l_2(D), l_2(D)};
  v_4 = BIT_INSERT_EXPR <_1, 0, 0 (64 bits)>;
  v_5 = BIT_INSERT_EXPR ;

so gcc could replace {l,l} with {0,l} after the first bit_insert_expr (_1 has a
single use), and then {0,0} after the second, one element at a time, the easy
case (though constructors are always tricky).

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-24 Thread glisse at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

--- Comment #2 from Marc Glisse  ---
We have trouble with your VSET macro (known issue):
  two_28 = BIT_INSERT_EXPR ;
  two_29 = BIT_INSERT_EXPR ;
  two_30 = BIT_INSERT_EXPR ;
  two_31 = BIT_INSERT_EXPR ;
it is easier for gcc if you write:
  v4do two={2,2,2,2};
or you could even replace two with 2 in the expressions, gcc handles it just
fine.

In reload, subregs are extracted via the stack, whereas the low subreg should
already be available (NOP) and the high one can be extracted by a single insn.
That's probably the first thing to investigate. (-mtune doesn't change what
happens)

res could be kept in a register (or even better a pair of registers) through
the whole loop.

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-24 Thread pinskia at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization, ra

--- Comment #1 from Andrew Pinski  ---
Does -march=intel fix the issue?  I am suspecting this is just a tuning issue
where the default (generic) tuning turns on an option which is needed for
better performance on some AMD machines.

[Bug target/79709] Subobtimal code with -mavx and explicit vector

2017-02-24 Thread tkoenig at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709

Thomas Koenig  changed:

   What|Removed |Added

 Target||x86_64-pc-linux-gnu
   Severity|normal  |enhancement

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

[Bug target/79709] Subobtimal code with -mavx and explicit vector

12 matches

Site Navigation

Mail list logo

Footer information