[Bug target/112943] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1176 with -O2 -march=westmere -mapxf

2023-12-11 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112943

Hongyu Wang  changed:

   What|Removed |Added

 CC||wwwhhhyyy333 at gmail dot com

--- Comment #2 from Hongyu Wang  ---
Sorry for introducing this, a patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640174.html

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-06 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

Hongyu Wang  changed:

   What|Removed |Added

 CC||wwwhhhyyy333 at gmail dot com

--- Comment #9 from Hongyu Wang  ---
(In reply to Hongtao Liu from comment #4)
> there're 2 reasons.

> 2. There's still spills for (subreg:DF (reg: V8DF) since
> ix86_modes_tieable_p return false for DF and V8DF.
> 

There could be some issue in sra that the aggregates are not properly
scalarized due to size limit.

The sra considers maximum aggregate size using move_ratio * UNITS_PER_WORD, but
here the aggregate Dual, 2l> actually contains several V8DF
component that can be handled in zmm under avx512f. 

Add --param sra-max-scalarization-size-Ospeed=2048 will eliminate those spills

So for sra we can consider using MOVE_MAX * move_ratio as the size limit for
Ospeed which represents real backend instruction count.

[Bug testsuite/112729] gcc.target/i386/apx-interrupt-1.c etc. FAIL

2023-11-28 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112729

--- Comment #7 from Hongyu Wang  ---
(In reply to r...@cebitec.uni-bielefeld.de from comment #5)
> 
> Is there a reason to have -fomit-frame-pointer once before and once
> after -mapx-features=push2pop2?

Ah, thanks for pointing that out. Will adjust the order to keep them after
-mapx-features.

[Bug testsuite/112729] gcc.target/i386/apx-interrupt-1.c etc. FAIL

2023-11-27 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112729

--- Comment #3 from Hongyu Wang  ---
Created attachment 56703
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56703=edit
A patch

Hi Rainer, can you help verify if the change make these test pass on
solaris/FreeBSD?

[Bug testsuite/112729] gcc.target/i386/apx-interrupt-1.c etc. FAIL

2023-11-27 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112729

--- Comment #2 from Hongyu Wang  ---
The cfi scan fails was caused by -fno-omit-frame-pointer which force push the
frame pointer first and the cfi info become different. By default we have
-fomit-frame-pointer on linux, but not other targets. I'd just add
-fomit-frame-pointer to these tests.

[Bug target/112394] ICE: in extract_constrain_insn, at recog.cc:2705 insn does not satisfy its constraints: {*vec_extractv2di_1} with -O -mavx512vbmi2 -mapxf -mno-sse4.2

2023-11-07 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112394

--- Comment #2 from Hongyu Wang  ---
Should be fixed.

[Bug tree-optimization/112325] New: Missed vectorization after cunrolli

2023-10-31 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

Bug ID: 112325
   Summary: Missed vectorization after cunrolli
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

testcase:

#include 
#include 

typedef struct {
float s;
int8_t qs[32];
} block;

void foo (const int n, float * restrict s, const int8_t q[4], const block *
restrict y) {
const int qk = 32;
const int nb = n / qk;

float sumf = 0.0;
int sumi = 0;

for (int i = 0; i < nb; i++) {
uint32_t qh;
memcpy(, q, 4);

for (int j = 0; j < qk/2; ++j) {
sumi += (qh >> j) * y[i].qs[j];
}
sumf += (y[i].s * (float) sumi);
}
*s = sumf;
}

This can be vectorized under -O2 -mavx512vl but not -O3 -mavx512vl, see
https://godbolt.org/z/csPr4cPen

Under -O3 -mavx512vl -fdisable-tree-cunrolli the loop can also be vectorized.

[Bug target/111127] [13/14 regression] Wrong code for avx512ne2ps2bf16_maskz intrinsics since gcc13

2023-08-24 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27

Hongyu Wang  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Hongyu Wang  ---
Fixed on trunk and gcc13.

[Bug target/111127] New: Wrong code for avx512ne2ps2bf16_maskz intrinsics since gcc13

2023-08-24 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27

Bug ID: 27
   Summary: Wrong code for avx512ne2ps2bf16_maskz intrinsics since
gcc13
   Product: gcc
   Version: 13.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

cat test.c

#include 

__m512bh cvttest(__mmask32 k, __m512 a, __m512 b)
{
  return _mm512_maskz_cvtne2ps_pbh (k,a,b);  
}

gcc -O2 -mavx512bf16

kmovd   %edi, %k1
vcvtne2ps2bf16  %zmm0, %zmm1, %zmm0{%k1}{z}
ret

The code is wrong compared to clang, the input operand order was inverted.

See https://godbolt.org/z/b161deerY

[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh

2023-06-27 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215

--- Comment #6 from Hongyu Wang  ---
Thanks for the fix, now for the attached test, main loop will not have any
load. 

There is a remaining issue that the loop epilogue still contains load from
stack and constant pool

.L9:
movslq  %edx, %rax
movss   72(%rsp), %xmm5
salq$2, %rax
leaq(%rbx,%rax), %rcx
movaps  %xmm5, %xmm1
subss   (%rcx), %xmm1
andps   .LC4(%rip), %xmm1
movss   %xmm1, (%rcx)
leal1(%rdx), %ecx
addss   %xmm1, %xmm0
cmpl%ecx, %r12d
jle .L8

IRA dump shows the pseudos does not have conflict but they still failed to be
allocated with register. This issue does not exist on aarch64.

[Bug lto/110424] New: Bogus ODR warning for FMV member function with -flto

2023-06-26 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110424

Bug ID: 110424
   Summary: Bogus ODR warning for FMV member function with -flto
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

cat m1.h

---
#pragma once 

class A  
{
public:  
  int foo1();
};   
---

cat m1.cpp

---
#include "m1.h"

__attribute__((target_clones("default","arch=icelake-server"))) 
int A::foo1()  
{  
return 0;  
}  

---

cat m2.cpp

---
#include "m1.h"   

int main()
{ 
  A a;
  return a.foo1();
} 
---

g++ -flto -Werror m1.cpp m2.cpp -o m2

m1.h:6:7: error: ‘foo1’ violates the C++ One Definition Rule [-Werror=odr]
6 |   int foo1(); 
  |   ^   
m1.cpp:9:1: note: ‘_ZN1A4foo1Ev’ was previously declared here 
9 | } 
  | ^ 
lto1: all warnings being treated as errors

The output binary should quite same as the one without lto, so the warning
seems to be bogus.

[Bug rtl-optimization/110215] New: RA fails to allocate register when loop invariant lives through EH region

2023-06-12 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215

Bug ID: 110215
   Summary: RA fails to allocate register when loop invariant
lives through EH region
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

Created attachment 55305
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55305=edit
A Testcase

Compiled with -Ofast, The innermost loop is

.L41:
movups  (%rax), %xmm3
movaps  (%rsp), %xmm0
addq$16, %rax
subps   %xmm3, %xmm0
andps   %xmm2, %xmm0
movups  %xmm0, -16(%rax)
addps   %xmm0, %xmm1
cmpq%rax, %rdx
jne .L41

While for Clang it produces

.LBB0_14:   #   Parent Loop BB0_3 Depth=1
movups  (%rbp,%rax), %xmm1
movaps  %xmm3, %xmm2
subps   %xmm1, %xmm2
andps   %xmm4, %xmm2
movups  %xmm2, (%rbp,%rax)
addps   %xmm2, %xmm0
addq$16, %rax
cmpq%rax, %r12
jne .LBB0_14

The loop invariant `base` was spilled to stack in GCC, but for clang it can
directly use a sse register.

Godbolt: https://godbolt.org/z/TTvG8M6E8

[Bug libstdc++/110138] Extra constructor called when using basic_string::operator+

2023-06-08 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110138

Hongyu Wang  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Hongyu Wang  ---
(In reply to Jonathan Wakely from comment #3)
> (In reply to Hongyu Wang from comment #0)
> > GCC 12.3/Clang 16 outputs:
> > Alloc: 3
> > Alloc: 6
> > Alloc: 9
> > Alloc: 12
> 
> "Clang 16" here actually means "Any version of Clang with libstdc++ headers
> from GCC 12".
> 
> The figures for Clang's own libc++ are different:
> 
> Alloc: 0
> Alloc: 4
> Alloc: 8
> Alloc: 12
> 
> But again, this is meaningless. Nobody cares how many times an allocator is
> copied.

The original test intends to verify P1165R1 implementation and it uses a global
counter on allocator constructor to see if it is correctly selected, and
current change makes it copied twice so the result is not expected.

But yes, I agree the allocator constructor for string should be cheap, and the
original test should not rely on how many times the constructor was called to
verify P1165R1 (I suppose checks if soccc was called instead).

Thanks for the explanation, I will close this as invalid.

[Bug libstdc++/110138] Extra constructor called when using basic_string::operator+

2023-06-06 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110138

--- Comment #1 from Hongyu Wang  ---
operator+ now calls std::__cxx11::basic_string,
myAlloc_ >::get_allocator, and it will call the constructor again after
gimplify

__attribute__((nodiscard))
struct allocator_type std::__cxx11::basic_string,
myAlloc_ >::get_allocator (
const struct basic_string * const this)
{
  try
{
  _1 = std::__cxx11::basic_string,
myAlloc_ >::_M_get_allocator (this);
  myAlloc_::myAlloc_ (, _1);
  return ;
}
  catch
{
  <<>>
}
  __builtin_unreachable trap ();
}

Possibly caused by r13-3814-gc93baa93df2d45

[Bug libstdc++/110138] New: Extra constructor called when using basic_string::operator+

2023-06-06 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110138

Bug ID: 110138
   Summary: Extra constructor called when using
basic_string::operator+
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

Created attachment 55268
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55268=edit
Simplified test

complied with -std=c++20 -O0

GCC 12.3/Clang 16 outputs:
Alloc: 3
Alloc: 6
Alloc: 9
Alloc: 12

GCC 13.1 outputs:
Alloc: 3
Alloc: 7
Alloc: 11
Alloc: 15

[Bug libgomp/109062] [13 regression] Default value of GOMP_SPINCOUNT changes since r13-2545

2023-03-08 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109062

Hongyu Wang  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Hongyu Wang  ---
Fixed on trunk so far.

[Bug libgomp/109062] New: [13 regression] Default value of GOMP_SPINCOUNT changes since r13-2545

2023-03-07 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109062

Bug ID: 109062
   Summary: [13 regression] Default value of GOMP_SPINCOUNT
changes since r13-2545
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Recently we found several big regressions on Phoronix OpenMP benchmark on
GCC13. The regressions is caused by r13-2545-g9f2fca56593a2b

The issue is, the default value of GOMP_SPINCOUNT is now 0, instead of 30
before this patch, which caused all Openmp program behaves like
OMP_WAIT_POLICY=passive.

As the comments in libgomp/env.c says:

 /* Using a rough estimation of 10 spins per msec,
use 5 min blocking for OMP_WAIT_POLICY=active,
3 msec blocking when OMP_WAIT_POLICY is not specificed
and 0 when OMP_WAIT_POLICY=passive.
Depending on the CPU speed, this can be e.g. 5 times longer
or 5 times shorter.  */

The current code for wait_policy is

if (none != NULL && gomp_get_icv_flag (none->flags, GOMP_ICV_WAIT_POLICY))
  wait_policy = none->icvs.wait_policy;
else if (all != NULL && gomp_get_icv_flag (all->flags, GOMP_ICV_WAIT_POLICY))
  wait_policy = all->icvs.wait_policy;

If OMP_WAIT_POLICY not specified, non of the branch will be entered since
gomp_get_icv_flag will return 0 by default, then wait_policy remains its value
as uninitialized. While prior to this patch wait_policy will be set to -1 (not
specified) by parse_wait_policy ().

[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases

2022-11-23 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692

--- Comment #12 from Hongyu Wang  ---
Fixed for GCC 13. Sorry for introducing this.

[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases

2022-11-18 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692

--- Comment #9 from Hongyu Wang  ---
(In reply to Segher Boessenkool from comment #8)
> (In reply to Jiu Fu Guo from comment #5)
> > > -munroll-only-small-loops does not turn on or off -funroll-loops, and it
> > > should not, so that it does what it says, if nothing else.
> > 
> > Yes, and -funroll-loops would win over -munroll-only-small-loops
> 
> -funroll-loops is the only thing that enables loop unrolling.
> -munroll-only-small-loops, like the name says, says to only unroll small
> loops,
> and no others.  It is not something at the same level as -funroll-loops, that
> would be insanity: other code likes to see if the user requested loops to be
> unrolled as well!

I can understand the logic, my initial patch
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html is something
similar to rs6000 and x86 only.
The difference is, -mno-unroll-only-small-loops -O2 would cause rtl-loop-unroll
takeing effect, and cunroll will also work if we follow the rs6000 change. We
do not really want these so the patch becomes ugly as said :(
I think the intension of -munroll-only-small-loops is to just adjust
rtl-loop-unrolling and do not touch middle-end unroll/cunroll. But I think your
point is also reasonable. Maybe we can split the flag_unroll_loops to tree and
rtl seperately?
Anyway I will propose a patch and re-discuss with maintainers later. Thanks!

[Bug tree-optimization/107717] [13 Regression] ICEs expanding permutes after g:dc95e1e9702f2f6367bbc108c8d01169be1b66d2

2022-11-18 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717

--- Comment #4 from Hongyu Wang  ---
(In reply to Tamar Christina from comment #3)
> Fixed

Thanks for the fix! It also give me a good tip for match pattern writing :)

[Bug middle-end/107734] [13 Regression] valgrind error for gcc/testsuite/cc.target/i386/pr46051.c

2022-11-18 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107734

--- Comment #12 from Hongyu Wang  ---
(In reply to Andrew Pinski from comment #9)
> Fixed.

Thanks for the fix! I was not aware that sbitmap does not have a default
constructor :(.

[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases

2022-11-17 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692

--- Comment #6 from Hongyu Wang  ---
(In reply to Jiu Fu Guo from comment #4)
> (In reply to Hongyu Wang from comment #2)
> > Created attachment 53897 [details]
> > A patch
> > 
> > Sorry for introducing these fails. Here is the patch.
> > 
> > I've tested the patch with cross-compler and all the fails disappeared, but
> > I don't have a powerpc to do full bootstrap & regtest (I'm still applying
> > for gcc farm account).
> > 
> > I'll send out the patch after I can access gcc farm for a power machine, or
> > hopefully someone can help testing the patch.
> > 
> > I suppose s390 has similar issue and I will update that accordingly.
> Hi,
> 
> One small comment, for code "if (!(flag_unroll_loops ||
> flag_unroll_all_loops))"
> we may need to add one more condition "|| loop->unroll", like what does in
> r13-3950 for i386.cc.  Otherwise, unroll pragma may be affected.

Yes, I've already posted the patch at
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606478.html

[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases

2022-11-14 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692

--- Comment #2 from Hongyu Wang  ---
Created attachment 53897
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53897=edit
A patch

Sorry for introducing these fails. Here is the patch.

I've tested the patch with cross-compler and all the fails disappeared, but I
don't have a powerpc to do full bootstrap & regtest (I'm still applying for gcc
farm account).

I'll send out the patch after I can access gcc farm for a power machine, or
hopefully someone can help testing the patch.

I suppose s390 has similar issue and I will update that accordingly.

[Bug target/107676] Nonsensical docs for -mrelax-cmpxchg-loop

2022-11-14 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107676

--- Comment #6 from Hongyu Wang  ---
(In reply to Andrew Pinski from comment #5)
> (In reply to Jonathan Wakely from comment #4)
> > I don't think __atomic_compare_exchange emits such a loop. This is about
> > __atomic_fetch_xor and friends, which do emit cmpxchg loops. But there are
> > four such functions to name.
> 
> Oh yes right.
> Then this:
> For compare and exchange loops that are emitted by some __atomic_* builtins
> (e.g. ), emit an atomic load before the loop and if the value was not
> the expected value, emit a pause instruction. This might reduce execussive
> cache bouncing of the memory.
> 
> 
> I think that is better wording than it was before. I hope the person who
> added this option can take over this to get it closer to what it should be.

Thanks for all the suggestions, a patch has been posted at
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606212.html

[Bug target/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake

2022-10-18 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304

--- Comment #10 from Hongyu Wang  ---
(In reply to H.J. Lu from comment #9)
> (In reply to Hongtao.liu from comment #8)
> > (In reply to H.J. Lu from comment #7)
> > > (In reply to Hongtao.liu from comment #6)
> > > > (In reply to Hongtao.liu from comment #5)
> > > > > (In reply to H.J. Lu from comment #4)
> > > > > > Since the default is -march=tigerlake, it enables AVX512 in the 
> > > > > > middle end.
> > > > > > When "arch=alderlake" disables AVX512, we fails to expand AVX512 to
> > > > > > non-AVX512
> > > > > > ISAs. It means that target_clones can't be more restrictive than the
> > > > > > default. We
> > > > > > should provide better diagnostics.
> > > > > 
> > > > > Is there any place checking ISA difference for target_clones?
> > > > 
> > > > ix86_valid_target_attribute_inner_p?
> > > 
> > > It may not have all ISA infos.  Will this
> > > 
> > > diff --git a/gcc/config/i386/i386-options.cc
> > > b/gcc/config/i386/i386-options.cc
> > > index acb2291e70f..1efaae132e9 100644
> > > --- a/gcc/config/i386/i386-options.cc
> > > +++ b/gcc/config/i386/i386-options.cc
> > > @@ -2953,6 +2953,14 @@ ix86_option_override_internal (bool main_args_p,
> > >   fine grained control & costing.  */
> > >SET_OPTION_IF_UNSET (opts, opts_set, param_vect_partial_vector_usage, 
> > > 0);
> > >  
> > > +  if (!main_args_p
> > > +  && _options != opts
> > > +  && (((opts->x_ix86_isa_flags & global_options.x_ix86_isa_flags)
> > > + != global_options.x_ix86_isa_flags)
> > > +|| ((opts->x_ix86_isa_flags2 & global_options.x_ix86_isa_flags2)
> > > +!= global_options.x_ix86_isa_flags2)))
> > > +error ("Target ISAs are more restrictive than the default");
> > > +
> > >return true;
> > >  }
> > >  
> > > work?
> > 
> > Looks reasonable to me.
> 
> It doesn't work since we may use target attribute to disable MMX/SSE/SSE2.
> This problem seems to be __builtin_shuffle related.

Clang works properly as it overrides -march= to any target clones. I suppose we
can do similar things in ix86_valid_target_attribute_p

https://godbolt.org/z/v7xT1zahd

[Bug target/106180] [13 Regression] ICE in extract_insn, at recog.cc:2791 since r13-1418-g73f942c08deef3

2022-07-04 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106180

Hongyu Wang  changed:

   What|Removed |Added

 CC||wwwhhhyyy333 at gmail dot com

--- Comment #2 from Hongyu Wang  ---
(In reply to Jakub Jelinek from comment #1)

> I think the r13-1418 change was just wrong.  It is fine to add a pattern
> with V2SF input rather than vec_select of first half of V4SF input, but I
> don't understand why you need to restrict one to memory_operand and the
> other to register_operand, why vector_operand "vm" can't be used for both.
> Not doing that ties hands of the register allocator, if something is memory
> during expansion, it would be always in memory, if something isn't memory,
> it couldn't ever be memory.
> Is your concern not getting a SIGSEGV if first 2 SF elts are at the end of a
> page and 2 further SF elts are in a non-mapped page?

The instruction cvtps2pd takes m64 as memory input, so the original pattern is
not proper since it allows V4SF memory input, although the generated code may
work since for unpack_lo the address is same. The cross-page issue is one of
the potential problem we can meet.

For this pattern, I think we can add

if (MEM_P (operands[1]))
  operands[1] = gen_lowpart (V2SFmode, operands[1])

There are many other unpacks_low expanders allowing memory input, but they
directly falls to cvt instructions. We plan to fix all them recently.

[Bug target/105339] [x86] missing AVX-512F scalef functions when optimization is disabled

2022-04-27 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105339

--- Comment #7 from Hongyu Wang  ---
Fixed for gcc-9/10/11/12.

[Bug target/105288] AVX/AVX512 casts should use the "v" constraint

2022-04-15 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105288

--- Comment #1 from Hongyu Wang  ---
I think should be these 2?

(define_insn_and_split "avx512f__"
  [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m")
(vec_concat:AVX512MODE2P
  (vec_concat:
(match_operand: 1 "nonimmediate_operand" "xm,x")
(unspec: [(const_int 0)] UNSPEC_CAST))
  (unspec: [(const_int 0)] UNSPEC_CAST)))]
  "TARGET_AVX512F && !(MEM_P (operands[0]) && MEM_P (operands[1]))"

(define_insn_and_split "avx512f__256"
  [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m")
(vec_concat:AVX512MODE2P
  (match_operand: 1 "nonimmediate_operand" "xm,x")
  (unspec: [(const_int 0)] UNSPEC_CAST)))]
  "TARGET_AVX512F && !(MEM_P (operands[0]) && MEM_P (operands[1]))"

The AVX insn shouldn't have constraints "v"

[Bug target/105034] [10/11/12 regression]Suboptimal codegen for min/max with -Os

2022-03-27 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105034

--- Comment #2 from Hongyu Wang  ---
For -O2 stv doesn't do such transform
Computing gain for chain #1...
  Instruction gain 8 for 7: {r84:SI=smax(r85:SI,0);clobber flags:CC;}
  REG_DEAD r85:SI
  REG_UNUSED flags:CC
  Instruction conversion gain: 8
  Registers conversion cost: 12
  Total gain: -4

Since sse->integer reg move cost is 6 for generic cost.

Buf for -Os the cost is 3 so it is consider to be profitable.
Computing gain for chain #1...
  Instruction gain 8 for 7: {r84:SI=smax(r85:SI,0);clobber flags:CC;}
  REG_DEAD r85:SI
  REG_UNUSED flags:CC
  Instruction conversion gain: 8
  Registers conversion cost: 6
  Total gain: 2

FWIW, the solution would be either adjust the ix86_size cost, or blocks out 
optimize_size in the stv gate.

[Bug target/104978] [avx512fp16] wrong code for _mm_mask_fcmadd_round_sch

2022-03-21 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104978

--- Comment #5 from Hongyu Wang  ---
Fixed for GCC 12.

[Bug target/104977] [avx512fp16] wrong code for vfmaddcsh when -masm=intel.

2022-03-20 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104977

--- Comment #3 from Hongyu Wang  ---
Fixed for GCC 12.

[Bug target/104726] gcc.target/i386/pr104551.c FAILs

2022-03-01 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104726

--- Comment #7 from Hongyu Wang  ---
Fixed for GCC 12.

[Bug target/104724] gcc.target/i386/avx512fp16-vcvtsi2sh-1b.c etc. FAIL

2022-03-01 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104724

--- Comment #4 from Hongyu Wang  ---
Fixed for GCC 12.

[Bug target/104726] gcc.target/i386/pr104551.c FAILs

2022-03-01 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104726

Hongyu Wang  changed:

   What|Removed |Added

  Attachment #52532|0   |1
is obsolete||

--- Comment #4 from Hongyu Wang  ---
Created attachment 52535
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52535=edit
Updated patch



(In reply to Jakub Jelinek from comment #3)
> Also, the builtin at the start of main compiled with -mavx2 is risky, there
> could be avx2 insns e.g. in the prologue. avx2-check.h is the usual way

Thanks for pointing it out, updated accordingly.

Hi Rainer, sorry for previous mistake, can you try the updated one?

[Bug target/104726] gcc.target/i386/pr104551.c FAILs

2022-03-01 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104726

--- Comment #1 from Hongyu Wang  ---
Created attachment 52532
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52532=edit
A patch

Hi Rainer, can you try this on your solaris system? We don't have such platform
to confirm it works.

I'll install it if it passes, or you can directly push it as an obvious fix.

[Bug target/104724] gcc.target/i386/avx512fp16-vcvtsi2sh-1b.c etc. FAIL

2022-03-01 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104724

--- Comment #1 from Hongyu Wang  ---
Created attachment 52531
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52531=edit
A patch

Hi Rainer, can you try this on your solaris system? We don't have such platform
to confirm it works.

I'll install it if it passes, or you can directly push it as an obvious fix.

[Bug rtl-optimization/104664] [12 Regression] ICE: in extract_constrain_insn, at recog.cc:2670 (insn does not satisfy its constraints) with -Og -ffinite-math-only

2022-02-28 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104664

--- Comment #6 from Hongyu Wang  ---
Fixed for GCC 12.

[Bug rtl-optimization/104664] [12 Regression] ICE: in extract_constrain_insn, at recog.cc:2670 (insn does not satisfy its constraints) with -Og -ffinite-math-only

2022-02-27 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104664

--- Comment #4 from Hongyu Wang  ---
(In reply to Uroš Bizjak from comment #3)

> Reconfirmed as RA issue.

I'm afraid we'd avoid pattern like

(insn 180 179 182 2 (set (reg:V8HF 220)
(subreg:V8HF (reg:HF 221) 0)) "pr104664.c":12:7 1710 {movv8hf_internal}

since we don't have corresponding pattern with subreg. Reload might not aware
of the newly inserted regs properly, as the message shows

Set class ALL_REGS for r221
Set class ALL_REGS for r220

I'm testing

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 6cf1a0b9cb6..658516d86a2 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -14883,7 +14883,12 @@ ix86_expand_vector_init_duplicate (bool mmx_ok,
machine_mode mode,
  dperm.one_operand_p = true;

  if (mode == V8HFmode)
-   tmp1 = lowpart_subreg (V8HFmode, force_reg (HFmode, val), HFmode);  
+   {
+ tmp1 = force_reg (HFmode, val);
+ tmp2 = gen_reg_rtx (mode);
+ emit_insn (gen_vec_setv8hf_0 (tmp2, CONST0_RTX (mode), tmp1));
+ tmp1 = gen_lowpart (mode, tmp2);
+   }

[Bug target/104664] ICE: in extract_constrain_insn, at recog.cc:2670 (insn does not satisfy its constraints) with -Og -ffinite-math-only

2022-02-24 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104664

--- Comment #2 from Hongyu Wang  ---
starting from r12-6021

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #19 from Hongyu Wang  ---
(In reply to Thiago Macieira from comment #18)
> (In reply to Jakub Jelinek from comment #17)
> > _Pragma("GCC target \"relax-cmpxchg-loop\"")
> > should do that (ditto target("relax-cmpxchg-loop") attribute).
> 
> The attribute is applied to a function. I'm hoping to do it for s block of
> code:
> 
>  _Pragma("GCC push_options")
>  _Pragma("GCC target \"relax-cmpxchg-loop\"")
>  __atomic_compare_exchange_weak();
>  _Pragma("GCC pop_options")

I'm not aware of any target __attribute__ or #Pragma can be used to code block,
at this level user can change their code directly, so I don't know why it is
needed..

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #15 from Hongyu Wang  ---
(In reply to Thiago Macieira from comment #14)
> I'd restrict relaxations to loops emitted by the compiler. All other atomic
> operations shouldn't be modified at all, unless the user asks for it. That
> includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as
> a pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the
> user.
> 
> I'd welcome the ability to relax the latter, especially if with one codebase
> I could be efficient in CAS architectures as well as LL/SC ones.

The latest patch relaxed the pure LOCK CMPXCHG with -mrelax-cmpxchg-loop as the
commit message shows. So if you want, I can split this part to another switch
like -mrelax-cmpxchg-insn.

[Bug target/103069] cmpxchg isn't optimized

2022-02-21 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #13 from Hongyu Wang  ---
All above glibc cases are now both relaxed by an load/cmp to skip cmpxchg under
-mrelax-cmpxchg-loop,

but for

>   do  
> {   
>   flags = THREAD_GETMEM (self, cancelhandling);
>   newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
>   flags & ~SETXID_BITMASK, flags);
> }   
>   while (flags != newval);

If we want to optimize it to lock btc, we need to know the cmpxchg lies in a
loop. So it may require an extra pass to do further analysis and optimize,
which is not a good idea to do in stage 4.

[Bug target/103069] cmpxchg isn't optimized

2022-02-15 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #11 from Hongyu Wang  ---


For the case with atomic_compare_exchange_weak_release, it can be expanded as

loop: mov%eax,%r8d
  and$0xfff8,%r8d
  mov(%r8),%rsi <--- load lock first
  cmp%rsi,%rax <--- compare with expected input
  jne.L2 <--- lock ne expected
  lock cmpxchg %r8d,(%rdi)
  mov%rsi,%rax <--- perform the behavior of failed cmpxchg
  jneloop

But this is not suitable for atomic_compare_exchange_strong, as the document
said

Unlike atomic_compare_exchange_weak, this strong version is required to always
return true when expected indeed compares equal to the contained object, not
allowing spurious failures. If we expand cmpxchg as above, it would result in
spurious failure since the load is not atomic. 

So for

 do
   pd->nextevent = __nptl_last_event;
 while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event,
  pd, pd->nextevent));

who invokes atomic_compare_exchange_strong we may not simply adjust the
expander. It is better to know the call is in loop condition and relax it
accordingly.

[Bug target/103771] New: Missed vectorization under -mavx512f -mavx512vl after r12-5489

2021-12-20 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

Bug ID: 103771
   Summary: Missed vectorization under -mavx512f -mavx512vl after
r12-5489
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

cat vect.c

typedef unsigned char uint8_t;  

static uint8_t x264_clip_uint8( int x ) 
{   
return x&(~255) ? (-x)>>31 : x; 
}   

void mc_weight( uint8_t * __restrict dst, uint8_t * __restrict src, int
i_width, int i_scale)
{   
  for( int x = 0; x < i_width; x++ )
dst[x] = x264_clip_uint8(src[x] * i_scale); 
}   

It can not be vectorized with -mavx512f -mavx512vl, but can be vectorized with
-mavx2, See https://godbolt.org/z/M1jx161f6

The commit https://gcc.gnu.org/cgi-bin/gcc-gitref.cgi?r=r12-5489 converts  (x &
(~255)) == 0 to x <= 255, which may trigger some missing pattern with
-mavx512vl. 

Also an 1.5% regression was found on -march=cascadelake due to missing 128bit
epilogue for this loop.

[Bug target/103571] ABI: V2HF, V4HF and V8HFmode argument passing issues

2021-12-06 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103571

Hongyu Wang  changed:

   What|Removed |Added

 CC||wwwhhhyyy333 at gmail dot com

--- Comment #3 from Hongyu Wang  ---
(In reply to Hongtao.liu from comment #2)
> > 
> > Also, baz iz highly un-optimal for 32bit targets.
> 
> Yes, it needs to be fixed, note w/ -mavx512fp16 codegen for baz is optimal
> on 32-bit target, maybe related to vector_mode_supported_p, but then why
> codegen for baz on 64-bit target is optimal w/o TARGET_AVX512FP16?

For V8HFmode that is unsupported in VALID_SSE2_REG_MODE, function_value_32 has

return gen_rtx_REG (orig_mode, regno); 

so the retval is (reg:BLK 20 xmm0).

while function_value_64 uses construct_container and returns

(parallel:BLK [   
(expr_list:REG_DEP_TRUE (reg:V8HF 20 xmm0)
(const_int 0 [0]))
])

This could be optimized to simple movaps finally.

So we may need to support V8HFmode in VALID_SSE2_REG_MODE if we don't want to
modify those function_args and function_value stuff.

[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized

2021-11-04 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #1 from Hongyu Wang  ---
__sync_val_compare_and_swap will be expanded to atomic_compare_exchange_strong
by default, should we restrict the check and return under
atomic_compare_exchange_weak which is allowed to fail spuriously?

[Bug target/102812] Unoptimal (and wrong) code for _Float16 insert

2021-10-20 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102812

--- Comment #3 from Hongyu Wang  ---
(In reply to Uroš Bizjak from comment #2)
> Please note that the code above should compile via ix86_expand_vector_set,
> similar to:
> 
> --cut here--
> typedef short v8hi __attribute__((__vector_size__(16)));
> 
> v8hi foo (short a)
> {
>   return (v8hi) {a, 0, 0, 0, 0, 0, 0, 0 };
> }
> --cut here--
> 
> that results in:
> 
> vpxor   %xmm0, %xmm0, %xmm0
> vpinsrw $0, %edi, %xmm0, %xmm0
> ret

Currently we have

if (TARGET_AVX512FP16 && VALID_AVX512FP16_REG_MODE (mode))
  return true;

in ix86_vector_mode_supported_p, so for SSE2 target V8HFmode would be returned
in BLKmode.

After I put V8HFmode to VALID_SSE2_REG_MODE the code would be like

vmovss  %xmm0, %xmm0, %xmm1
vpxor   %xmm0, %xmm0, %xmm0
pextrw  $0, %xmm1, -10(%rsp)   
vpinsrw $0, -10(%rsp), %xmm0, %xmm0

Seems IRA spills the HF reg to memory..

I wonder whether we should move vector mode support to sse2 for now, as we
don't have sufficient HF vector arithmetic emulation for non-avx512fp16 target.

[Bug target/102835] gcc.target/i386/avx512fp16-trunchf.c FAILs

2021-10-19 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102835

--- Comment #1 from Hongyu Wang  ---
(In reply to Rainer Orth from comment #0)

> 
> I wonder what's the best way to handle the difference?  Just add
> -fomit-frame-pointer
> to the testcase or allow for the %ebp vs. %esp difference?

For this test we just want to check mnemonics are properly generated, so I
think we can allow either esp/ebp output for different system.

[Bug target/102806] New: [x86] Suboptimal codegen for v4hi vector concat under -mavx512bw and -mavx512vl

2021-10-18 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102806

Bug ID: 102806
   Summary: [x86] Suboptimal codegen for v4hi vector concat under
-mavx512bw and -mavx512vl
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

For

typedef short v8hi __attribute__((vector_size (16)));
typedef short v4hi __attribute__((vector_size (8))); 

v8hi foov (v4hi a, v4hi b)   
{   
 return __builtin_shufflevector (a, b, 0, 1, 2, 3, 4, 5, 6, 7);
}

gcc -O2 -mavx512vl -mavx512bw:

vmovq   %xmm0, %xmm2
vmovq   %xmm1, %xmm1
vmovdqa .LC0(%rip), %xmm0
vpermi2w%xmm1, %xmm2, %xmm0
ret

While clang with same option:

vmovlhps%xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0],xmm1[0]
retq

It looks like expand order of permutation should be adjusted

[Bug tree-optimization/101993] Potential vectorization opportunity when condition checks array address

2021-08-20 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101993

--- Comment #2 from Hongyu Wang  ---
(In reply to Richard Biener from comment #1)
> We can vectorize this with masked moves when using AVX2.  clang seems to
> simply remove the test completely - C seems to guarantee that a + i is a
> valid pointer
> if any of a + i is accessed and thus a + i is never NULL.
> 
> But then - just don't write such stupid checks?  What real-world code was
> this testcase created from?

It came from 538.imagick_r

---
#define GetPixelIndex(indexes) \
  ((indexes == (const unsigned short *) NULL) ? 0 : (*(indexes)))

for (v=0; v < (ssize_t) kernel->height; v++) {
  for (u=0; u < (ssize_t) kernel->width; u++, k--) {
if ( IsNaN(*k) ) continue;
result.red += (*k)*k_pixels[u].red;
result.green   += (*k)*k_pixels[u].green;
result.blue+= (*k)*k_pixels[u].blue;
result.opacity += (*k)*k_pixels[u].opacity;
if ( image->colorspace == CMYKColorspace)
  result.index += (*k)*GetPixelIndex(k_indexes+u);
  }
  k_pixels += virt_width;
  k_indexes += virt_width;
}
---

I extracted it to a small test in https://godbolt.org/z/G5h6nWvb5
which can be vectorized by clang but not gcc due to such pattern.

> 
> There is currently no optimization phase that would use loop info to elide
> NULL pointer checks and I'm not sure where I'd put such.  Note the argument
> for
> GCC would be that the access *(a + i) infers that a + i does not "overflow"
> to another object (including NULL).  That's sth the points-to solver would
> assume here (but the points-to solver is bad at tracking NULL).

[Bug tree-optimization/101993] New: Potential vectorization opportunity when condition checks array address

2021-08-19 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101993

Bug ID: 101993
   Summary: Potential vectorization opportunity when condition
checks array address
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

For

float foo(int * restrict a, int * restrict res, int n)
{
  int i;
  for (i = 0; i < 8; i++)
  {
if (a + i)
  res[i] = *(a + i) * 2;
  }
}

Compile with -O3

Clang generates

foo:# @foo
testq   %rdi, %rdi
je  .LBB0_2
movdqu  (%rdi), %xmm0
paddd   %xmm0, %xmm0
movdqu  %xmm0, (%rsi)
.LBB0_2:
retq

While GCC generates

foo:
testq   %rdi, %rdi
je  .L5
movl(%rdi), %eax
leaq8(%rdi), %rdx
addl%eax, %eax
movl%eax, (%rsi)
movl4(%rdi), %eax
addl%eax, %eax
.L3:
movl%eax, 4(%rsi)
movl(%rdx), %eax
addl%eax, %eax
movl%eax, 8(%rsi)
movl12(%rdi), %eax
addl%eax, %eax
movl%eax, 12(%rsi)
ret
.L5:
movl4, %eax
movl$8, %edx
addl%eax, %eax
jmp .L3

If a is 0 or negative then it should be an invalid pointer. It seems clang have
such assumption and test a first then optimize loop body.

Is it possible for GCC to do such optimization?

[Bug target/101395] [11/12 regression] Compile failure with -march=native -m32 on sapphirerapids

2021-07-12 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395

--- Comment #10 from Hongyu Wang  ---
(In reply to H.J. Lu from comment #9)
> Created attachment 51143 [details]
> A patch
> 
> Try this instead.

This also works.

[Bug target/101395] [11/12 regression] Compile failure with -march=native -m32 on sapphirerapids

2021-07-10 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395

--- Comment #4 from Hongyu Wang  ---
(In reply to H.J. Lu from comment #3)
> Created attachment 51125 [details]
> An updated patch

This works, thanks.

[Bug target/101395] [11/12 regression] Compile failure with -march=native -m32 on sapphirerapids

2021-07-09 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395

--- Comment #2 from Hongyu Wang  ---
(In reply to H.J. Lu from comment #1)
> Created attachment 51124 [details]
> A patch
> 
> Please test this patch.

It doesn't work.

I use ./sde-external-8.63.0-2021-01-18-lin/sde -spr -- gcc test.c -march=native
-m32 to verify it.

[Bug target/101395] New: Compile failure with -march=native -m32 on sapphirerapids

2021-07-09 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395

Bug ID: 101395
   Summary: Compile failure with -march=native -m32 on
sapphirerapids
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

cat test.c

int main()
{
   return 0;
}

On sapphire rapids machine,

gcc test.c -march=native -m32

will get 

cc1: error: ‘-muintr’ not supported for 32-bit code

[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body

2021-07-07 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176

--- Comment #9 from Hongyu Wang  ---
(In reply to Richard Biener from comment #8)

> I'm failing to reproduce with the sincos example since sincos is transformed
> to __builtin_cexpi for me.  When using

I always generate sincosf with g++ -Ofast -fopenmp-simd -std=c++11, perhaps it
is related to libm? I'm using RHEL8 with glibc 2.28.

> so I don't think it buys us anything to handle calls yet.  sincos would
> also be considered as possibly not returning.
> 

Perhaps, since the sincosf case could only be vectorized with #pragma omp simd.
But I think it is better to allow those functions with libmvec implementation
if the input params are proved to be safe (such as local variables).

[Bug target/101276] [i386] Keylocker output should be cleared when instruction reports runtime error.

2021-07-02 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101276

Hongyu Wang  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongyu Wang  ---
Fixed by
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1aeefa5720a71e622e2f26bf10ec8e7ecbd76f4c

[Bug target/101276] New: [i386] Keylocker output should be cleared when instruction reports runtime error.

2021-06-30 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101276

Bug ID: 101276
   Summary: [i386] Keylocker output should be cleared when
instruction reports runtime error.
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

Some keylocker instruction will set ZF when runtime occurs, and the output data
should be invalid. 

Current intrinsic just copy the input data to output regardless of the ZF, like

 movdqa  k2(%rip), %xmm0
 aesdec128kl h1(%rip), %xmm0
 sete%al
 movups  %xmm0, k1(%rip)

It could bring safety issue that return the unencrypted data when runtime error
occurs. So the code should be like

movdqa  k2(%rip), %xmm0
aesdec128kl h1(%rip), %xmm0
je  .L4
.L2:
sete%al
movups  %xmm0, k1(%rip)
ret
.L4:
pxor%xmm0, %xmm0
jmp .L2

To clear the output data.

[Bug tree-optimization/98339] New: GCC could not vectorize loop with conditional reduced add and store

2020-12-16 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98339

Bug ID: 98339
   Summary: GCC could not vectorize loop with conditional reduced
add and store
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

For testcase

void foo(
int* restrict x,  
int n,
int start,   
int m,
int* restrict ret
)   
{
for (int i = 0; i < n; i++)
{
int pos = start + i;
if ( pos <= m)
ret[0] += x[i];
}
}

with -O3 -mavx2 it could not be vectorized because ret[0] += x[i] is zero step
MASK_STORE inside loop, and dr analysis failed for zero step store.

But with manually loop store motion

void foo2(
int* restrict x,  
int n,
int start,   
int m,
int* restrict ret
)   
{
int tmp = 0;

for (int i = 0; i < n; i++)
{
int pos = start + i;
if (pos <= m)
tmp += x[i];
}

ret[0] += tmp;
}

could be vectorized. 

godbolt: https://godbolt.org/z/Kcv8hP

There is no LIM between ifcvt and vect, and current LIM could not handle
MASK_STORE. Is there any possibility to vectorize foo, like by doing loop store
motion in ifcvt instead of creating MASK_STORE?

[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body

2020-12-15 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176

--- Comment #7 from Hongyu Wang  ---
(In reply to Richard Biener from comment #5)
> Yes.
> 
> For a LIM testcase an example with a memcpy might be more practically
> relevant.
> 
> For refactoring I'd start with classifying the unanalyzable refs as
> separate ref ID, marking it with another bit like ref_unanalyzed in
> in_mem_ref and asserting there's a single access of such refs.
> The mem_refs_may_alias_p code then needs to use stmt-based alias
> queries instead of refs_may_alias_p_1 using accesses_in_loop[0]->stmt.
> 
> And code testing for UNANALYZABLE_MEM_ID now needs to look at the
> ref_unanalyzed flag to not consider those refs for transforms.
> 
> Note this may blow up the memory requirements for testcases with lots
> of "unanalyzable" refs.
> 
> The nonpure-call code is more difficult to improve, even sincos can not
> return
> when the access to s or c traps.  Analyzing the arguments might help here.
> If you disregard that detail I think all ECF_LEAF|ECF_NOTHROW functions
> return normally.

Thanks for the suggestion, I did some refactor accordingly and this case could
be vectorized. 

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 92e5a8dd774..3e3e81bc36f 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -119,6 +119,8 @@ public:
   (its index in memory_accesses.refs_list)  */
   unsigned ref_canonical : 1;   /* Whether mem.ref was canonicalized.  */
   unsigned ref_decomposed : 1;  /* Whether the ref was hashed from mem.  */
+  unsigned ref_unanalyzed : 1; /* Whether the ref was unanalyzed memory.  */
+
   hashval_t hash;  /* Its hash value.  */

   /* The memory access itself and associated caching of alias-oracle
@@ -260,7 +262,14 @@ static bool refs_independent_p (im_mem_ref *, im_mem_ref
*, bool = true);
 #define UNANALYZABLE_MEM_ID 0

 /* Whether the reference was analyzable.  */
-#define MEM_ANALYZABLE(REF) ((REF)->id != UNANALYZABLE_MEM_ID)
+#define MEM_ANALYZABLE(REF) ((REF)->id != UNANALYZABLE_MEM_ID  \
+&& !(REF)->ref_unanalyzed)
+
+#define REF_ID_UNANALYZABLE(id)   
\
+  (id == UNANALYZABLE_MEM_ID   \
+   || ((memory_accesses.refs_list[id]) \
+   && (memory_accesses.refs_list[id]->ref_unanalyzed)) \
+   )

 static struct lim_aux_data *
 init_lim_data (gimple *stmt)
@@ -829,7 +838,8 @@ set_profitable_level (gimple *stmt)
   set_level (stmt, gimple_bb (stmt)->loop_father, get_lim_data
(stmt)->max_loop);
 }

-/* Returns true if STMT is a call that has side effects.  */
+/* Returns true if STMT is a call that has side effects, or it is
+   not a function call with ECF_LEAF | ECF_NOTHROW.  */

 static bool
 nonpure_call_p (gimple *stmt)
@@ -837,6 +847,11 @@ nonpure_call_p (gimple *stmt)
   if (gimple_code (stmt) != GIMPLE_CALL)
 return false;

+  /* Simplified here, better to analyze call parameter.  */
+  int flags = gimple_call_flags (stmt);
+  if (flags & (ECF_LEAF | ECF_NOTHROW))
+return false;
+
   return gimple_has_side_effects (stmt);
 }

@@ -1377,6 +1392,7 @@ mem_ref_alloc (ao_ref *mem, unsigned hash, unsigned id)
   ref->id = id;
   ref->ref_canonical = false;
   ref->ref_decomposed = false;
+  ref->ref_unanalyzed = false;
   ref->hash = hash;
   ref->stored = NULL;
   ref->loaded = NULL;
@@ -1461,9 +1477,13 @@ gather_mem_refs_stmt (class loop *loop, gimple *stmt)
   mem = simple_mem_ref_in_stmt (stmt, _stored);
   if (!mem)
 {
-  /* We use the shared mem_ref for all unanalyzable refs.  */
-  id = UNANALYZABLE_MEM_ID;
-  ref = memory_accesses.refs_list[id];
+  /* Mark unanaylzable refs with different id and skip analysis. */
+  id = memory_accesses.refs_list.length ();
+  ref = mem_ref_alloc (NULL, 0, id);
+  ref->ref_unanalyzed = true;
+  memory_accesses.refs_list.safe_push (ref);
+  record_mem_ref_loc (ref, stmt, NULL);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
{
  fprintf (dump_file, "Unanalyzed memory reference %u: ", id);
@@ -1576,7 +1596,7 @@ gather_mem_refs_stmt (class loop *loop, gimple *stmt)
   mark_ref_stored (ref, loop);
 }
   /* A not simple memory op is also a read when it is a write.  */
-  if (!is_stored || id == UNANALYZABLE_MEM_ID)
+  if (!is_stored || REF_ID_UNANALYZABLE (id))
 {
   bitmap_set_bit (_accesses.refs_loaded_in_loop[loop->num],
ref->id);
   mark_ref_loaded (ref, loop);
@@ -1701,6 +1721,31 @@ mem_refs_may_alias_p (im_mem_ref *mem1, im_mem_ref
*mem2,
   poly_widest_int size1, size2;
   aff_tree off1, off2;

+  /* For refs marked as unanalyzed, use stmt_based alias analysis
+ and returns false when one mem_ref used by this unanalyzed stmt*/
+  if (mem1->ref_unanalyzed
+  || mem2->ref_unanalyzed)
+{
+  if (mem1->ref_unanalyzed
+ && 

[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body

2020-12-08 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176

--- Comment #6 from Hongyu Wang  ---
(In reply to Richard Biener from comment #5)
> (In reply to Hongyu Wang from comment #4)
> > (In reply to Richard Biener from comment #3)
> >  
> > > I see ret[0] has store-motion applied.  You don't see it vectorized
> > > because GCC doesn't know how to vectorize sincos (or cexpi which is
> > > what it lowers it to).
> > 
> > I doubt so, after manually store motion
> > 
> > #include 
> > 
> > float foo(
> > int *x,  
> > int n, 
> > float tx
> > )   
> > {
> > float ret[n];
> > float tmp;
> > 
> > #pragma omp simd
> > for (int i = 0; i < n; i++)
> > {
> > float s, c;
> > 
> > sincosf( tx * x[i] , ,  );  
> > 
> > tmp += s*c; 
> > }
> > 
> > ret[0] += tmp; 
> > 
> > return ret[0];
> > }
> > 
> > with -Ofast -fopenmp-simd -std=c++11 it could be vectorized to call   
> > _ZGVbN4vvv_sincosf
> > 
> > ret[0] is moved for sinf() case, but not sincosf() with above options.
> 
> What target are you targeting?  Can you provide the sincosf prototype
> from your math.h?  (please attach preprocessed source).
> 
> I cannot reproduce sincosf _not_ being lowered to cexpif and thus
> no longer having memory writes.
> 

I used g++ on godbolt:

https://gcc.godbolt.org/z/rv45MK

Below extern is sufficient for g++ to vectorize the code

__attribute__ ((__simd__ ("notinbranch"))) extern void sincosf (float __x,
float *__sinx, float *__cosx); 

compiled with -Ofast -fopenmp-simd -std=c++11 -march=x86-64

[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body

2020-12-08 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176

--- Comment #4 from Hongyu Wang  ---
(In reply to Richard Biener from comment #3)

> I see ret[0] has store-motion applied.  You don't see it vectorized
> because GCC doesn't know how to vectorize sincos (or cexpi which is
> what it lowers it to).

I doubt so, after manually store motion

#include 

float foo(
int *x,  
int n, 
float tx
)   
{
float ret[n];
float tmp;

#pragma omp simd
for (int i = 0; i < n; i++)
{
float s, c;

sincosf( tx * x[i] , ,  );  

tmp += s*c; 
}

ret[0] += tmp; 

return ret[0];
}

with -Ofast -fopenmp-simd -std=c++11 it could be vectorized to call   
_ZGVbN4vvv_sincosf

ret[0] is moved for sinf() case, but not sincosf() with above options.

> 
> If you replace sincosf with a random call then you'll hit the issue
> that LIMs dependence analysis doesn't handle it at all since it cannot
> represent it.  That will block further optimization in the loop.
> 
> That can possibly be improved.
> 

So could LIMs dependence analysis handle known library function and just
analyze their memory parameter? Random call may have unknown behavior.

> > if (nonpure_call_p (stmt))
> >   {
> >  maybe_never = true; 
> >  outermost = NULL;  
> >   }
> > 
> > So no store-motion chance for any future statement in such block.
> 
> That's another issue - the call may not return.  Here the granularity
> is per BB and thus loads/stores in the same BB are not considered for
> sinking.
> 

IMHO the condition may be too strict for known library calls.

[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body

2020-12-07 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176

--- Comment #2 from Hongyu Wang  ---
>> I doubt the call is the issue btw.

The aliasing could be removed by 

float foo(int *x, int n, float tx)   
{
float ret[n];

#pragma omp simd
for (int i = 0; i < n; i++)
{
float s, c;

s = c = tx * x[i];

ret[0] += s*c;
}

return ret[0];
}

This is successfully vectorized, and the dump from lim2 has:

Moving statement
ret.1__I_lsm.7 = (*ret.1_18)[0];

But for 

float foo(int *x, int n, float tx)   
{
float ret[n];

#pragma omp simd
for (int i = 0; i < n; i++)
{
float s, c;

sincosf( tx * x[i] , ,  );

ret[0] += s*c;
}

return ret[0];
}

It still could not be vectorized. I did initial debugging and see
tree-ssa-loop-im.c has

if (nonpure_call_p (stmt))
  {
 maybe_never = true; 
 outermost = NULL;  
  }

So no store-motion chance for any future statement in such block.

As a comparison, this could also be vectorized with simd clone:

float foo(int *x, int n, float tx)   
{
float ret[n];

#pragma omp simd
for (int i = 0; i < n; i++)
{
float s, c;

s = c = sinf( tx * x[i]);

ret[0] += s*c;
}

return ret[0];
}

[Bug tree-optimization/98176] New: Loop invariant memory could not be hoisted when nonpure_call in loop body

2020-12-07 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176

Bug ID: 98176
   Summary: Loop invariant memory could not be hoisted when
nonpure_call in loop body
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

For testcase

#include 

void foo(float *x, float tx, float *ret, int n)
{

#pragma omp simd
for (int i = 0; i < n; i++)
{
float s,c;

sincosf(x[i] * tx, , );

*ret += s * c;   
}
}

It could not be vectorized with -Ofast -fopenmp-simd -std=c++11

https://gcc.godbolt.org/z/ba77az

By manually hoist it could be vectorized with simd clone

void foo(float *x, float tx, float *ret, int n)
{
float tmp = 0.0f;

#pragma omp simd
for (int i = 0; i < n; i++)
{
float s,c;

sincosf(x[i] * tx, , );

tmp += s*c;  
}
*ret += tmp;
}

https://gcc.godbolt.org/z/bea17x

Is it possible for lim to perform store motion on case like this?

[Bug target/97231] Missing FSF copyright notes for some x86 intrinsic headers

2020-09-28 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97231

--- Comment #1 from Hongyu Wang  ---
Created attachment 49280
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49280=edit
A patch

[Bug target/97231] New: Missing FSF copyright notes for some x86 intrinsic headers

2020-09-28 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97231

Bug ID: 97231
   Summary: Missing FSF copyright notes for some x86 intrinsic
headers
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

Many x86 intrinsic header files doesn't have FSF copyright:

amxbf16intrin.h
amxint8intrin.h
amxtileintrin.h
avx512vp2intersectintrin.h
avx512vp2intersectvlintrin.h
pconfigintrin.h
tsxldtrkintrin.h
wbnoinvdintrin.h