[Bug target/112943] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1176 with -O2 -march=westmere -mapxf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112943 Hongyu Wang changed: What|Removed |Added CC||wwwhhhyyy333 at gmail dot com --- Comment #2 from Hongyu Wang --- Sorry for introducing this, a patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640174.html
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 Hongyu Wang changed: What|Removed |Added CC||wwwhhhyyy333 at gmail dot com --- Comment #9 from Hongyu Wang --- (In reply to Hongtao Liu from comment #4) > there're 2 reasons. > 2. There's still spills for (subreg:DF (reg: V8DF) since > ix86_modes_tieable_p return false for DF and V8DF. > There could be some issue in sra that the aggregates are not properly scalarized due to size limit. The sra considers maximum aggregate size using move_ratio * UNITS_PER_WORD, but here the aggregate Dual, 2l> actually contains several V8DF component that can be handled in zmm under avx512f. Add --param sra-max-scalarization-size-Ospeed=2048 will eliminate those spills So for sra we can consider using MOVE_MAX * move_ratio as the size limit for Ospeed which represents real backend instruction count.
[Bug testsuite/112729] gcc.target/i386/apx-interrupt-1.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112729 --- Comment #7 from Hongyu Wang --- (In reply to r...@cebitec.uni-bielefeld.de from comment #5) > > Is there a reason to have -fomit-frame-pointer once before and once > after -mapx-features=push2pop2? Ah, thanks for pointing that out. Will adjust the order to keep them after -mapx-features.
[Bug testsuite/112729] gcc.target/i386/apx-interrupt-1.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112729 --- Comment #3 from Hongyu Wang --- Created attachment 56703 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56703=edit A patch Hi Rainer, can you help verify if the change make these test pass on solaris/FreeBSD?
[Bug testsuite/112729] gcc.target/i386/apx-interrupt-1.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112729 --- Comment #2 from Hongyu Wang --- The cfi scan fails was caused by -fno-omit-frame-pointer which force push the frame pointer first and the cfi info become different. By default we have -fomit-frame-pointer on linux, but not other targets. I'd just add -fomit-frame-pointer to these tests.
[Bug target/112394] ICE: in extract_constrain_insn, at recog.cc:2705 insn does not satisfy its constraints: {*vec_extractv2di_1} with -O -mavx512vbmi2 -mapxf -mno-sse4.2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112394 --- Comment #2 from Hongyu Wang --- Should be fixed.
[Bug tree-optimization/112325] New: Missed vectorization after cunrolli
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 Bug ID: 112325 Summary: Missed vectorization after cunrolli Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- testcase: #include #include typedef struct { float s; int8_t qs[32]; } block; void foo (const int n, float * restrict s, const int8_t q[4], const block * restrict y) { const int qk = 32; const int nb = n / qk; float sumf = 0.0; int sumi = 0; for (int i = 0; i < nb; i++) { uint32_t qh; memcpy(, q, 4); for (int j = 0; j < qk/2; ++j) { sumi += (qh >> j) * y[i].qs[j]; } sumf += (y[i].s * (float) sumi); } *s = sumf; } This can be vectorized under -O2 -mavx512vl but not -O3 -mavx512vl, see https://godbolt.org/z/csPr4cPen Under -O3 -mavx512vl -fdisable-tree-cunrolli the loop can also be vectorized.
[Bug target/111127] [13/14 regression] Wrong code for avx512ne2ps2bf16_maskz intrinsics since gcc13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27 Hongyu Wang changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Hongyu Wang --- Fixed on trunk and gcc13.
[Bug target/111127] New: Wrong code for avx512ne2ps2bf16_maskz intrinsics since gcc13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27 Bug ID: 27 Summary: Wrong code for avx512ne2ps2bf16_maskz intrinsics since gcc13 Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- cat test.c #include __m512bh cvttest(__mmask32 k, __m512 a, __m512 b) { return _mm512_maskz_cvtne2ps_pbh (k,a,b); } gcc -O2 -mavx512bf16 kmovd %edi, %k1 vcvtne2ps2bf16 %zmm0, %zmm1, %zmm0{%k1}{z} ret The code is wrong compared to clang, the input operand order was inverted. See https://godbolt.org/z/b161deerY
[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 --- Comment #6 from Hongyu Wang --- Thanks for the fix, now for the attached test, main loop will not have any load. There is a remaining issue that the loop epilogue still contains load from stack and constant pool .L9: movslq %edx, %rax movss 72(%rsp), %xmm5 salq$2, %rax leaq(%rbx,%rax), %rcx movaps %xmm5, %xmm1 subss (%rcx), %xmm1 andps .LC4(%rip), %xmm1 movss %xmm1, (%rcx) leal1(%rdx), %ecx addss %xmm1, %xmm0 cmpl%ecx, %r12d jle .L8 IRA dump shows the pseudos does not have conflict but they still failed to be allocated with register. This issue does not exist on aarch64.
[Bug lto/110424] New: Bogus ODR warning for FMV member function with -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110424 Bug ID: 110424 Summary: Bogus ODR warning for FMV member function with -flto Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- cat m1.h --- #pragma once class A { public: int foo1(); }; --- cat m1.cpp --- #include "m1.h" __attribute__((target_clones("default","arch=icelake-server"))) int A::foo1() { return 0; } --- cat m2.cpp --- #include "m1.h" int main() { A a; return a.foo1(); } --- g++ -flto -Werror m1.cpp m2.cpp -o m2 m1.h:6:7: error: ‘foo1’ violates the C++ One Definition Rule [-Werror=odr] 6 | int foo1(); | ^ m1.cpp:9:1: note: ‘_ZN1A4foo1Ev’ was previously declared here 9 | } | ^ lto1: all warnings being treated as errors The output binary should quite same as the one without lto, so the warning seems to be bogus.
[Bug rtl-optimization/110215] New: RA fails to allocate register when loop invariant lives through EH region
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 Bug ID: 110215 Summary: RA fails to allocate register when loop invariant lives through EH region Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- Created attachment 55305 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55305=edit A Testcase Compiled with -Ofast, The innermost loop is .L41: movups (%rax), %xmm3 movaps (%rsp), %xmm0 addq$16, %rax subps %xmm3, %xmm0 andps %xmm2, %xmm0 movups %xmm0, -16(%rax) addps %xmm0, %xmm1 cmpq%rax, %rdx jne .L41 While for Clang it produces .LBB0_14: # Parent Loop BB0_3 Depth=1 movups (%rbp,%rax), %xmm1 movaps %xmm3, %xmm2 subps %xmm1, %xmm2 andps %xmm4, %xmm2 movups %xmm2, (%rbp,%rax) addps %xmm2, %xmm0 addq$16, %rax cmpq%rax, %r12 jne .LBB0_14 The loop invariant `base` was spilled to stack in GCC, but for clang it can directly use a sse register. Godbolt: https://godbolt.org/z/TTvG8M6E8
[Bug libstdc++/110138] Extra constructor called when using basic_string::operator+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110138 Hongyu Wang changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #4 from Hongyu Wang --- (In reply to Jonathan Wakely from comment #3) > (In reply to Hongyu Wang from comment #0) > > GCC 12.3/Clang 16 outputs: > > Alloc: 3 > > Alloc: 6 > > Alloc: 9 > > Alloc: 12 > > "Clang 16" here actually means "Any version of Clang with libstdc++ headers > from GCC 12". > > The figures for Clang's own libc++ are different: > > Alloc: 0 > Alloc: 4 > Alloc: 8 > Alloc: 12 > > But again, this is meaningless. Nobody cares how many times an allocator is > copied. The original test intends to verify P1165R1 implementation and it uses a global counter on allocator constructor to see if it is correctly selected, and current change makes it copied twice so the result is not expected. But yes, I agree the allocator constructor for string should be cheap, and the original test should not rely on how many times the constructor was called to verify P1165R1 (I suppose checks if soccc was called instead). Thanks for the explanation, I will close this as invalid.
[Bug libstdc++/110138] Extra constructor called when using basic_string::operator+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110138 --- Comment #1 from Hongyu Wang --- operator+ now calls std::__cxx11::basic_string, myAlloc_ >::get_allocator, and it will call the constructor again after gimplify __attribute__((nodiscard)) struct allocator_type std::__cxx11::basic_string, myAlloc_ >::get_allocator ( const struct basic_string * const this) { try { _1 = std::__cxx11::basic_string, myAlloc_ >::_M_get_allocator (this); myAlloc_::myAlloc_ (, _1); return ; } catch { <<>> } __builtin_unreachable trap (); } Possibly caused by r13-3814-gc93baa93df2d45
[Bug libstdc++/110138] New: Extra constructor called when using basic_string::operator+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110138 Bug ID: 110138 Summary: Extra constructor called when using basic_string::operator+ Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- Created attachment 55268 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55268=edit Simplified test complied with -std=c++20 -O0 GCC 12.3/Clang 16 outputs: Alloc: 3 Alloc: 6 Alloc: 9 Alloc: 12 GCC 13.1 outputs: Alloc: 3 Alloc: 7 Alloc: 11 Alloc: 15
[Bug libgomp/109062] [13 regression] Default value of GOMP_SPINCOUNT changes since r13-2545
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109062 Hongyu Wang changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Hongyu Wang --- Fixed on trunk so far.
[Bug libgomp/109062] New: [13 regression] Default value of GOMP_SPINCOUNT changes since r13-2545
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109062 Bug ID: 109062 Summary: [13 regression] Default value of GOMP_SPINCOUNT changes since r13-2545 Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com CC: jakub at gcc dot gnu.org Target Milestone: --- Recently we found several big regressions on Phoronix OpenMP benchmark on GCC13. The regressions is caused by r13-2545-g9f2fca56593a2b The issue is, the default value of GOMP_SPINCOUNT is now 0, instead of 30 before this patch, which caused all Openmp program behaves like OMP_WAIT_POLICY=passive. As the comments in libgomp/env.c says: /* Using a rough estimation of 10 spins per msec, use 5 min blocking for OMP_WAIT_POLICY=active, 3 msec blocking when OMP_WAIT_POLICY is not specificed and 0 when OMP_WAIT_POLICY=passive. Depending on the CPU speed, this can be e.g. 5 times longer or 5 times shorter. */ The current code for wait_policy is if (none != NULL && gomp_get_icv_flag (none->flags, GOMP_ICV_WAIT_POLICY)) wait_policy = none->icvs.wait_policy; else if (all != NULL && gomp_get_icv_flag (all->flags, GOMP_ICV_WAIT_POLICY)) wait_policy = all->icvs.wait_policy; If OMP_WAIT_POLICY not specified, non of the branch will be entered since gomp_get_icv_flag will return 0 by default, then wait_policy remains its value as uninitialized. While prior to this patch wait_policy will be set to -1 (not specified) by parse_wait_policy ().
[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692 --- Comment #12 from Hongyu Wang --- Fixed for GCC 13. Sorry for introducing this.
[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692 --- Comment #9 from Hongyu Wang --- (In reply to Segher Boessenkool from comment #8) > (In reply to Jiu Fu Guo from comment #5) > > > -munroll-only-small-loops does not turn on or off -funroll-loops, and it > > > should not, so that it does what it says, if nothing else. > > > > Yes, and -funroll-loops would win over -munroll-only-small-loops > > -funroll-loops is the only thing that enables loop unrolling. > -munroll-only-small-loops, like the name says, says to only unroll small > loops, > and no others. It is not something at the same level as -funroll-loops, that > would be insanity: other code likes to see if the user requested loops to be > unrolled as well! I can understand the logic, my initial patch https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html is something similar to rs6000 and x86 only. The difference is, -mno-unroll-only-small-loops -O2 would cause rtl-loop-unroll takeing effect, and cunroll will also work if we follow the rs6000 change. We do not really want these so the patch becomes ugly as said :( I think the intension of -munroll-only-small-loops is to just adjust rtl-loop-unrolling and do not touch middle-end unroll/cunroll. But I think your point is also reasonable. Maybe we can split the flag_unroll_loops to tree and rtl seperately? Anyway I will propose a patch and re-discuss with maintainers later. Thanks!
[Bug tree-optimization/107717] [13 Regression] ICEs expanding permutes after g:dc95e1e9702f2f6367bbc108c8d01169be1b66d2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717 --- Comment #4 from Hongyu Wang --- (In reply to Tamar Christina from comment #3) > Fixed Thanks for the fix! It also give me a good tip for match pattern writing :)
[Bug middle-end/107734] [13 Regression] valgrind error for gcc/testsuite/cc.target/i386/pr46051.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107734 --- Comment #12 from Hongyu Wang --- (In reply to Andrew Pinski from comment #9) > Fixed. Thanks for the fix! I was not aware that sbitmap does not have a default constructor :(.
[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692 --- Comment #6 from Hongyu Wang --- (In reply to Jiu Fu Guo from comment #4) > (In reply to Hongyu Wang from comment #2) > > Created attachment 53897 [details] > > A patch > > > > Sorry for introducing these fails. Here is the patch. > > > > I've tested the patch with cross-compler and all the fails disappeared, but > > I don't have a powerpc to do full bootstrap & regtest (I'm still applying > > for gcc farm account). > > > > I'll send out the patch after I can access gcc farm for a power machine, or > > hopefully someone can help testing the patch. > > > > I suppose s390 has similar issue and I will update that accordingly. > Hi, > > One small comment, for code "if (!(flag_unroll_loops || > flag_unroll_all_loops))" > we may need to add one more condition "|| loop->unroll", like what does in > r13-3950 for i386.cc. Otherwise, unroll pragma may be affected. Yes, I've already posted the patch at https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606478.html
[Bug target/107692] [13 regression] r13-3950-g071e428c24ee8c breaks many test cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107692 --- Comment #2 from Hongyu Wang --- Created attachment 53897 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53897=edit A patch Sorry for introducing these fails. Here is the patch. I've tested the patch with cross-compler and all the fails disappeared, but I don't have a powerpc to do full bootstrap & regtest (I'm still applying for gcc farm account). I'll send out the patch after I can access gcc farm for a power machine, or hopefully someone can help testing the patch. I suppose s390 has similar issue and I will update that accordingly.
[Bug target/107676] Nonsensical docs for -mrelax-cmpxchg-loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107676 --- Comment #6 from Hongyu Wang --- (In reply to Andrew Pinski from comment #5) > (In reply to Jonathan Wakely from comment #4) > > I don't think __atomic_compare_exchange emits such a loop. This is about > > __atomic_fetch_xor and friends, which do emit cmpxchg loops. But there are > > four such functions to name. > > Oh yes right. > Then this: > For compare and exchange loops that are emitted by some __atomic_* builtins > (e.g. ), emit an atomic load before the loop and if the value was not > the expected value, emit a pause instruction. This might reduce execussive > cache bouncing of the memory. > > > I think that is better wording than it was before. I hope the person who > added this option can take over this to get it closer to what it should be. Thanks for all the suggestions, a patch has been posted at https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606212.html
[Bug target/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304 --- Comment #10 from Hongyu Wang --- (In reply to H.J. Lu from comment #9) > (In reply to Hongtao.liu from comment #8) > > (In reply to H.J. Lu from comment #7) > > > (In reply to Hongtao.liu from comment #6) > > > > (In reply to Hongtao.liu from comment #5) > > > > > (In reply to H.J. Lu from comment #4) > > > > > > Since the default is -march=tigerlake, it enables AVX512 in the > > > > > > middle end. > > > > > > When "arch=alderlake" disables AVX512, we fails to expand AVX512 to > > > > > > non-AVX512 > > > > > > ISAs. It means that target_clones can't be more restrictive than the > > > > > > default. We > > > > > > should provide better diagnostics. > > > > > > > > > > Is there any place checking ISA difference for target_clones? > > > > > > > > ix86_valid_target_attribute_inner_p? > > > > > > It may not have all ISA infos. Will this > > > > > > diff --git a/gcc/config/i386/i386-options.cc > > > b/gcc/config/i386/i386-options.cc > > > index acb2291e70f..1efaae132e9 100644 > > > --- a/gcc/config/i386/i386-options.cc > > > +++ b/gcc/config/i386/i386-options.cc > > > @@ -2953,6 +2953,14 @@ ix86_option_override_internal (bool main_args_p, > > > fine grained control & costing. */ > > >SET_OPTION_IF_UNSET (opts, opts_set, param_vect_partial_vector_usage, > > > 0); > > > > > > + if (!main_args_p > > > + && _options != opts > > > + && (((opts->x_ix86_isa_flags & global_options.x_ix86_isa_flags) > > > + != global_options.x_ix86_isa_flags) > > > +|| ((opts->x_ix86_isa_flags2 & global_options.x_ix86_isa_flags2) > > > +!= global_options.x_ix86_isa_flags2))) > > > +error ("Target ISAs are more restrictive than the default"); > > > + > > >return true; > > > } > > > > > > work? > > > > Looks reasonable to me. > > It doesn't work since we may use target attribute to disable MMX/SSE/SSE2. > This problem seems to be __builtin_shuffle related. Clang works properly as it overrides -march= to any target clones. I suppose we can do similar things in ix86_valid_target_attribute_p https://godbolt.org/z/v7xT1zahd
[Bug target/106180] [13 Regression] ICE in extract_insn, at recog.cc:2791 since r13-1418-g73f942c08deef3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106180 Hongyu Wang changed: What|Removed |Added CC||wwwhhhyyy333 at gmail dot com --- Comment #2 from Hongyu Wang --- (In reply to Jakub Jelinek from comment #1) > I think the r13-1418 change was just wrong. It is fine to add a pattern > with V2SF input rather than vec_select of first half of V4SF input, but I > don't understand why you need to restrict one to memory_operand and the > other to register_operand, why vector_operand "vm" can't be used for both. > Not doing that ties hands of the register allocator, if something is memory > during expansion, it would be always in memory, if something isn't memory, > it couldn't ever be memory. > Is your concern not getting a SIGSEGV if first 2 SF elts are at the end of a > page and 2 further SF elts are in a non-mapped page? The instruction cvtps2pd takes m64 as memory input, so the original pattern is not proper since it allows V4SF memory input, although the generated code may work since for unpack_lo the address is same. The cross-page issue is one of the potential problem we can meet. For this pattern, I think we can add if (MEM_P (operands[1])) operands[1] = gen_lowpart (V2SFmode, operands[1]) There are many other unpacks_low expanders allowing memory input, but they directly falls to cvt instructions. We plan to fix all them recently.
[Bug target/105339] [x86] missing AVX-512F scalef functions when optimization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105339 --- Comment #7 from Hongyu Wang --- Fixed for gcc-9/10/11/12.
[Bug target/105288] AVX/AVX512 casts should use the "v" constraint
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105288 --- Comment #1 from Hongyu Wang --- I think should be these 2? (define_insn_and_split "avx512f__" [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m") (vec_concat:AVX512MODE2P (vec_concat: (match_operand: 1 "nonimmediate_operand" "xm,x") (unspec: [(const_int 0)] UNSPEC_CAST)) (unspec: [(const_int 0)] UNSPEC_CAST)))] "TARGET_AVX512F && !(MEM_P (operands[0]) && MEM_P (operands[1]))" (define_insn_and_split "avx512f__256" [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m") (vec_concat:AVX512MODE2P (match_operand: 1 "nonimmediate_operand" "xm,x") (unspec: [(const_int 0)] UNSPEC_CAST)))] "TARGET_AVX512F && !(MEM_P (operands[0]) && MEM_P (operands[1]))" The AVX insn shouldn't have constraints "v"
[Bug target/105034] [10/11/12 regression]Suboptimal codegen for min/max with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105034 --- Comment #2 from Hongyu Wang --- For -O2 stv doesn't do such transform Computing gain for chain #1... Instruction gain 8 for 7: {r84:SI=smax(r85:SI,0);clobber flags:CC;} REG_DEAD r85:SI REG_UNUSED flags:CC Instruction conversion gain: 8 Registers conversion cost: 12 Total gain: -4 Since sse->integer reg move cost is 6 for generic cost. Buf for -Os the cost is 3 so it is consider to be profitable. Computing gain for chain #1... Instruction gain 8 for 7: {r84:SI=smax(r85:SI,0);clobber flags:CC;} REG_DEAD r85:SI REG_UNUSED flags:CC Instruction conversion gain: 8 Registers conversion cost: 6 Total gain: 2 FWIW, the solution would be either adjust the ix86_size cost, or blocks out optimize_size in the stv gate.
[Bug target/104978] [avx512fp16] wrong code for _mm_mask_fcmadd_round_sch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104978 --- Comment #5 from Hongyu Wang --- Fixed for GCC 12.
[Bug target/104977] [avx512fp16] wrong code for vfmaddcsh when -masm=intel.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104977 --- Comment #3 from Hongyu Wang --- Fixed for GCC 12.
[Bug target/104726] gcc.target/i386/pr104551.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104726 --- Comment #7 from Hongyu Wang --- Fixed for GCC 12.
[Bug target/104724] gcc.target/i386/avx512fp16-vcvtsi2sh-1b.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104724 --- Comment #4 from Hongyu Wang --- Fixed for GCC 12.
[Bug target/104726] gcc.target/i386/pr104551.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104726 Hongyu Wang changed: What|Removed |Added Attachment #52532|0 |1 is obsolete|| --- Comment #4 from Hongyu Wang --- Created attachment 52535 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52535=edit Updated patch (In reply to Jakub Jelinek from comment #3) > Also, the builtin at the start of main compiled with -mavx2 is risky, there > could be avx2 insns e.g. in the prologue. avx2-check.h is the usual way Thanks for pointing it out, updated accordingly. Hi Rainer, sorry for previous mistake, can you try the updated one?
[Bug target/104726] gcc.target/i386/pr104551.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104726 --- Comment #1 from Hongyu Wang --- Created attachment 52532 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52532=edit A patch Hi Rainer, can you try this on your solaris system? We don't have such platform to confirm it works. I'll install it if it passes, or you can directly push it as an obvious fix.
[Bug target/104724] gcc.target/i386/avx512fp16-vcvtsi2sh-1b.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104724 --- Comment #1 from Hongyu Wang --- Created attachment 52531 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52531=edit A patch Hi Rainer, can you try this on your solaris system? We don't have such platform to confirm it works. I'll install it if it passes, or you can directly push it as an obvious fix.
[Bug rtl-optimization/104664] [12 Regression] ICE: in extract_constrain_insn, at recog.cc:2670 (insn does not satisfy its constraints) with -Og -ffinite-math-only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104664 --- Comment #6 from Hongyu Wang --- Fixed for GCC 12.
[Bug rtl-optimization/104664] [12 Regression] ICE: in extract_constrain_insn, at recog.cc:2670 (insn does not satisfy its constraints) with -Og -ffinite-math-only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104664 --- Comment #4 from Hongyu Wang --- (In reply to Uroš Bizjak from comment #3) > Reconfirmed as RA issue. I'm afraid we'd avoid pattern like (insn 180 179 182 2 (set (reg:V8HF 220) (subreg:V8HF (reg:HF 221) 0)) "pr104664.c":12:7 1710 {movv8hf_internal} since we don't have corresponding pattern with subreg. Reload might not aware of the newly inserted regs properly, as the message shows Set class ALL_REGS for r221 Set class ALL_REGS for r220 I'm testing diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 6cf1a0b9cb6..658516d86a2 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -14883,7 +14883,12 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode, dperm.one_operand_p = true; if (mode == V8HFmode) - tmp1 = lowpart_subreg (V8HFmode, force_reg (HFmode, val), HFmode); + { + tmp1 = force_reg (HFmode, val); + tmp2 = gen_reg_rtx (mode); + emit_insn (gen_vec_setv8hf_0 (tmp2, CONST0_RTX (mode), tmp1)); + tmp1 = gen_lowpart (mode, tmp2); + }
[Bug target/104664] ICE: in extract_constrain_insn, at recog.cc:2670 (insn does not satisfy its constraints) with -Og -ffinite-math-only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104664 --- Comment #2 from Hongyu Wang --- starting from r12-6021
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #19 from Hongyu Wang --- (In reply to Thiago Macieira from comment #18) > (In reply to Jakub Jelinek from comment #17) > > _Pragma("GCC target \"relax-cmpxchg-loop\"") > > should do that (ditto target("relax-cmpxchg-loop") attribute). > > The attribute is applied to a function. I'm hoping to do it for s block of > code: > > _Pragma("GCC push_options") > _Pragma("GCC target \"relax-cmpxchg-loop\"") > __atomic_compare_exchange_weak(); > _Pragma("GCC pop_options") I'm not aware of any target __attribute__ or #Pragma can be used to code block, at this level user can change their code directly, so I don't know why it is needed..
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #15 from Hongyu Wang --- (In reply to Thiago Macieira from comment #14) > I'd restrict relaxations to loops emitted by the compiler. All other atomic > operations shouldn't be modified at all, unless the user asks for it. That > includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as > a pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the > user. > > I'd welcome the ability to relax the latter, especially if with one codebase > I could be efficient in CAS architectures as well as LL/SC ones. The latest patch relaxed the pure LOCK CMPXCHG with -mrelax-cmpxchg-loop as the commit message shows. So if you want, I can split this part to another switch like -mrelax-cmpxchg-insn.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #13 from Hongyu Wang --- All above glibc cases are now both relaxed by an load/cmp to skip cmpxchg under -mrelax-cmpxchg-loop, but for > do > { > flags = THREAD_GETMEM (self, cancelhandling); > newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, > flags & ~SETXID_BITMASK, flags); > } > while (flags != newval); If we want to optimize it to lock btc, we need to know the cmpxchg lies in a loop. So it may require an extra pass to do further analysis and optimize, which is not a good idea to do in stage 4.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #11 from Hongyu Wang --- For the case with atomic_compare_exchange_weak_release, it can be expanded as loop: mov%eax,%r8d and$0xfff8,%r8d mov(%r8),%rsi <--- load lock first cmp%rsi,%rax <--- compare with expected input jne.L2 <--- lock ne expected lock cmpxchg %r8d,(%rdi) mov%rsi,%rax <--- perform the behavior of failed cmpxchg jneloop But this is not suitable for atomic_compare_exchange_strong, as the document said Unlike atomic_compare_exchange_weak, this strong version is required to always return true when expected indeed compares equal to the contained object, not allowing spurious failures. If we expand cmpxchg as above, it would result in spurious failure since the load is not atomic. So for do pd->nextevent = __nptl_last_event; while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event, pd, pd->nextevent)); who invokes atomic_compare_exchange_strong we may not simply adjust the expander. It is better to know the call is in loop condition and relax it accordingly.
[Bug target/103771] New: Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 Bug ID: 103771 Summary: Missed vectorization under -mavx512f -mavx512vl after r12-5489 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- cat vect.c typedef unsigned char uint8_t; static uint8_t x264_clip_uint8( int x ) { return x&(~255) ? (-x)>>31 : x; } void mc_weight( uint8_t * __restrict dst, uint8_t * __restrict src, int i_width, int i_scale) { for( int x = 0; x < i_width; x++ ) dst[x] = x264_clip_uint8(src[x] * i_scale); } It can not be vectorized with -mavx512f -mavx512vl, but can be vectorized with -mavx2, See https://godbolt.org/z/M1jx161f6 The commit https://gcc.gnu.org/cgi-bin/gcc-gitref.cgi?r=r12-5489 converts (x & (~255)) == 0 to x <= 255, which may trigger some missing pattern with -mavx512vl. Also an 1.5% regression was found on -march=cascadelake due to missing 128bit epilogue for this loop.
[Bug target/103571] ABI: V2HF, V4HF and V8HFmode argument passing issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103571 Hongyu Wang changed: What|Removed |Added CC||wwwhhhyyy333 at gmail dot com --- Comment #3 from Hongyu Wang --- (In reply to Hongtao.liu from comment #2) > > > > Also, baz iz highly un-optimal for 32bit targets. > > Yes, it needs to be fixed, note w/ -mavx512fp16 codegen for baz is optimal > on 32-bit target, maybe related to vector_mode_supported_p, but then why > codegen for baz on 64-bit target is optimal w/o TARGET_AVX512FP16? For V8HFmode that is unsupported in VALID_SSE2_REG_MODE, function_value_32 has return gen_rtx_REG (orig_mode, regno); so the retval is (reg:BLK 20 xmm0). while function_value_64 uses construct_container and returns (parallel:BLK [ (expr_list:REG_DEP_TRUE (reg:V8HF 20 xmm0) (const_int 0 [0])) ]) This could be optimized to simple movaps finally. So we may need to support V8HFmode in VALID_SSE2_REG_MODE if we don't want to modify those function_args and function_value stuff.
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #1 from Hongyu Wang --- __sync_val_compare_and_swap will be expanded to atomic_compare_exchange_strong by default, should we restrict the check and return under atomic_compare_exchange_weak which is allowed to fail spuriously?
[Bug target/102812] Unoptimal (and wrong) code for _Float16 insert
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102812 --- Comment #3 from Hongyu Wang --- (In reply to Uroš Bizjak from comment #2) > Please note that the code above should compile via ix86_expand_vector_set, > similar to: > > --cut here-- > typedef short v8hi __attribute__((__vector_size__(16))); > > v8hi foo (short a) > { > return (v8hi) {a, 0, 0, 0, 0, 0, 0, 0 }; > } > --cut here-- > > that results in: > > vpxor %xmm0, %xmm0, %xmm0 > vpinsrw $0, %edi, %xmm0, %xmm0 > ret Currently we have if (TARGET_AVX512FP16 && VALID_AVX512FP16_REG_MODE (mode)) return true; in ix86_vector_mode_supported_p, so for SSE2 target V8HFmode would be returned in BLKmode. After I put V8HFmode to VALID_SSE2_REG_MODE the code would be like vmovss %xmm0, %xmm0, %xmm1 vpxor %xmm0, %xmm0, %xmm0 pextrw $0, %xmm1, -10(%rsp) vpinsrw $0, -10(%rsp), %xmm0, %xmm0 Seems IRA spills the HF reg to memory.. I wonder whether we should move vector mode support to sse2 for now, as we don't have sufficient HF vector arithmetic emulation for non-avx512fp16 target.
[Bug target/102835] gcc.target/i386/avx512fp16-trunchf.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102835 --- Comment #1 from Hongyu Wang --- (In reply to Rainer Orth from comment #0) > > I wonder what's the best way to handle the difference? Just add > -fomit-frame-pointer > to the testcase or allow for the %ebp vs. %esp difference? For this test we just want to check mnemonics are properly generated, so I think we can allow either esp/ebp output for different system.
[Bug target/102806] New: [x86] Suboptimal codegen for v4hi vector concat under -mavx512bw and -mavx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102806 Bug ID: 102806 Summary: [x86] Suboptimal codegen for v4hi vector concat under -mavx512bw and -mavx512vl Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- For typedef short v8hi __attribute__((vector_size (16))); typedef short v4hi __attribute__((vector_size (8))); v8hi foov (v4hi a, v4hi b) { return __builtin_shufflevector (a, b, 0, 1, 2, 3, 4, 5, 6, 7); } gcc -O2 -mavx512vl -mavx512bw: vmovq %xmm0, %xmm2 vmovq %xmm1, %xmm1 vmovdqa .LC0(%rip), %xmm0 vpermi2w%xmm1, %xmm2, %xmm0 ret While clang with same option: vmovlhps%xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0] retq It looks like expand order of permutation should be adjusted
[Bug tree-optimization/101993] Potential vectorization opportunity when condition checks array address
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101993 --- Comment #2 from Hongyu Wang --- (In reply to Richard Biener from comment #1) > We can vectorize this with masked moves when using AVX2. clang seems to > simply remove the test completely - C seems to guarantee that a + i is a > valid pointer > if any of a + i is accessed and thus a + i is never NULL. > > But then - just don't write such stupid checks? What real-world code was > this testcase created from? It came from 538.imagick_r --- #define GetPixelIndex(indexes) \ ((indexes == (const unsigned short *) NULL) ? 0 : (*(indexes))) for (v=0; v < (ssize_t) kernel->height; v++) { for (u=0; u < (ssize_t) kernel->width; u++, k--) { if ( IsNaN(*k) ) continue; result.red += (*k)*k_pixels[u].red; result.green += (*k)*k_pixels[u].green; result.blue+= (*k)*k_pixels[u].blue; result.opacity += (*k)*k_pixels[u].opacity; if ( image->colorspace == CMYKColorspace) result.index += (*k)*GetPixelIndex(k_indexes+u); } k_pixels += virt_width; k_indexes += virt_width; } --- I extracted it to a small test in https://godbolt.org/z/G5h6nWvb5 which can be vectorized by clang but not gcc due to such pattern. > > There is currently no optimization phase that would use loop info to elide > NULL pointer checks and I'm not sure where I'd put such. Note the argument > for > GCC would be that the access *(a + i) infers that a + i does not "overflow" > to another object (including NULL). That's sth the points-to solver would > assume here (but the points-to solver is bad at tracking NULL).
[Bug tree-optimization/101993] New: Potential vectorization opportunity when condition checks array address
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101993 Bug ID: 101993 Summary: Potential vectorization opportunity when condition checks array address Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- For float foo(int * restrict a, int * restrict res, int n) { int i; for (i = 0; i < 8; i++) { if (a + i) res[i] = *(a + i) * 2; } } Compile with -O3 Clang generates foo:# @foo testq %rdi, %rdi je .LBB0_2 movdqu (%rdi), %xmm0 paddd %xmm0, %xmm0 movdqu %xmm0, (%rsi) .LBB0_2: retq While GCC generates foo: testq %rdi, %rdi je .L5 movl(%rdi), %eax leaq8(%rdi), %rdx addl%eax, %eax movl%eax, (%rsi) movl4(%rdi), %eax addl%eax, %eax .L3: movl%eax, 4(%rsi) movl(%rdx), %eax addl%eax, %eax movl%eax, 8(%rsi) movl12(%rdi), %eax addl%eax, %eax movl%eax, 12(%rsi) ret .L5: movl4, %eax movl$8, %edx addl%eax, %eax jmp .L3 If a is 0 or negative then it should be an invalid pointer. It seems clang have such assumption and test a first then optimize loop body. Is it possible for GCC to do such optimization?
[Bug target/101395] [11/12 regression] Compile failure with -march=native -m32 on sapphirerapids
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395 --- Comment #10 from Hongyu Wang --- (In reply to H.J. Lu from comment #9) > Created attachment 51143 [details] > A patch > > Try this instead. This also works.
[Bug target/101395] [11/12 regression] Compile failure with -march=native -m32 on sapphirerapids
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395 --- Comment #4 from Hongyu Wang --- (In reply to H.J. Lu from comment #3) > Created attachment 51125 [details] > An updated patch This works, thanks.
[Bug target/101395] [11/12 regression] Compile failure with -march=native -m32 on sapphirerapids
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395 --- Comment #2 from Hongyu Wang --- (In reply to H.J. Lu from comment #1) > Created attachment 51124 [details] > A patch > > Please test this patch. It doesn't work. I use ./sde-external-8.63.0-2021-01-18-lin/sde -spr -- gcc test.c -march=native -m32 to verify it.
[Bug target/101395] New: Compile failure with -march=native -m32 on sapphirerapids
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101395 Bug ID: 101395 Summary: Compile failure with -march=native -m32 on sapphirerapids Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- cat test.c int main() { return 0; } On sapphire rapids machine, gcc test.c -march=native -m32 will get cc1: error: ‘-muintr’ not supported for 32-bit code
[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176 --- Comment #9 from Hongyu Wang --- (In reply to Richard Biener from comment #8) > I'm failing to reproduce with the sincos example since sincos is transformed > to __builtin_cexpi for me. When using I always generate sincosf with g++ -Ofast -fopenmp-simd -std=c++11, perhaps it is related to libm? I'm using RHEL8 with glibc 2.28. > so I don't think it buys us anything to handle calls yet. sincos would > also be considered as possibly not returning. > Perhaps, since the sincosf case could only be vectorized with #pragma omp simd. But I think it is better to allow those functions with libmvec implementation if the input params are proved to be safe (such as local variables).
[Bug target/101276] [i386] Keylocker output should be cleared when instruction reports runtime error.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101276 Hongyu Wang changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #4 from Hongyu Wang --- Fixed by https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1aeefa5720a71e622e2f26bf10ec8e7ecbd76f4c
[Bug target/101276] New: [i386] Keylocker output should be cleared when instruction reports runtime error.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101276 Bug ID: 101276 Summary: [i386] Keylocker output should be cleared when instruction reports runtime error. Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- Some keylocker instruction will set ZF when runtime occurs, and the output data should be invalid. Current intrinsic just copy the input data to output regardless of the ZF, like movdqa k2(%rip), %xmm0 aesdec128kl h1(%rip), %xmm0 sete%al movups %xmm0, k1(%rip) It could bring safety issue that return the unencrypted data when runtime error occurs. So the code should be like movdqa k2(%rip), %xmm0 aesdec128kl h1(%rip), %xmm0 je .L4 .L2: sete%al movups %xmm0, k1(%rip) ret .L4: pxor%xmm0, %xmm0 jmp .L2 To clear the output data.
[Bug tree-optimization/98339] New: GCC could not vectorize loop with conditional reduced add and store
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98339 Bug ID: 98339 Summary: GCC could not vectorize loop with conditional reduced add and store Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- For testcase void foo( int* restrict x, int n, int start, int m, int* restrict ret ) { for (int i = 0; i < n; i++) { int pos = start + i; if ( pos <= m) ret[0] += x[i]; } } with -O3 -mavx2 it could not be vectorized because ret[0] += x[i] is zero step MASK_STORE inside loop, and dr analysis failed for zero step store. But with manually loop store motion void foo2( int* restrict x, int n, int start, int m, int* restrict ret ) { int tmp = 0; for (int i = 0; i < n; i++) { int pos = start + i; if (pos <= m) tmp += x[i]; } ret[0] += tmp; } could be vectorized. godbolt: https://godbolt.org/z/Kcv8hP There is no LIM between ifcvt and vect, and current LIM could not handle MASK_STORE. Is there any possibility to vectorize foo, like by doing loop store motion in ifcvt instead of creating MASK_STORE?
[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176 --- Comment #7 from Hongyu Wang --- (In reply to Richard Biener from comment #5) > Yes. > > For a LIM testcase an example with a memcpy might be more practically > relevant. > > For refactoring I'd start with classifying the unanalyzable refs as > separate ref ID, marking it with another bit like ref_unanalyzed in > in_mem_ref and asserting there's a single access of such refs. > The mem_refs_may_alias_p code then needs to use stmt-based alias > queries instead of refs_may_alias_p_1 using accesses_in_loop[0]->stmt. > > And code testing for UNANALYZABLE_MEM_ID now needs to look at the > ref_unanalyzed flag to not consider those refs for transforms. > > Note this may blow up the memory requirements for testcases with lots > of "unanalyzable" refs. > > The nonpure-call code is more difficult to improve, even sincos can not > return > when the access to s or c traps. Analyzing the arguments might help here. > If you disregard that detail I think all ECF_LEAF|ECF_NOTHROW functions > return normally. Thanks for the suggestion, I did some refactor accordingly and this case could be vectorized. diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 92e5a8dd774..3e3e81bc36f 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -119,6 +119,8 @@ public: (its index in memory_accesses.refs_list) */ unsigned ref_canonical : 1; /* Whether mem.ref was canonicalized. */ unsigned ref_decomposed : 1; /* Whether the ref was hashed from mem. */ + unsigned ref_unanalyzed : 1; /* Whether the ref was unanalyzed memory. */ + hashval_t hash; /* Its hash value. */ /* The memory access itself and associated caching of alias-oracle @@ -260,7 +262,14 @@ static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true); #define UNANALYZABLE_MEM_ID 0 /* Whether the reference was analyzable. */ -#define MEM_ANALYZABLE(REF) ((REF)->id != UNANALYZABLE_MEM_ID) +#define MEM_ANALYZABLE(REF) ((REF)->id != UNANALYZABLE_MEM_ID \ +&& !(REF)->ref_unanalyzed) + +#define REF_ID_UNANALYZABLE(id) \ + (id == UNANALYZABLE_MEM_ID \ + || ((memory_accesses.refs_list[id]) \ + && (memory_accesses.refs_list[id]->ref_unanalyzed)) \ + ) static struct lim_aux_data * init_lim_data (gimple *stmt) @@ -829,7 +838,8 @@ set_profitable_level (gimple *stmt) set_level (stmt, gimple_bb (stmt)->loop_father, get_lim_data (stmt)->max_loop); } -/* Returns true if STMT is a call that has side effects. */ +/* Returns true if STMT is a call that has side effects, or it is + not a function call with ECF_LEAF | ECF_NOTHROW. */ static bool nonpure_call_p (gimple *stmt) @@ -837,6 +847,11 @@ nonpure_call_p (gimple *stmt) if (gimple_code (stmt) != GIMPLE_CALL) return false; + /* Simplified here, better to analyze call parameter. */ + int flags = gimple_call_flags (stmt); + if (flags & (ECF_LEAF | ECF_NOTHROW)) +return false; + return gimple_has_side_effects (stmt); } @@ -1377,6 +1392,7 @@ mem_ref_alloc (ao_ref *mem, unsigned hash, unsigned id) ref->id = id; ref->ref_canonical = false; ref->ref_decomposed = false; + ref->ref_unanalyzed = false; ref->hash = hash; ref->stored = NULL; ref->loaded = NULL; @@ -1461,9 +1477,13 @@ gather_mem_refs_stmt (class loop *loop, gimple *stmt) mem = simple_mem_ref_in_stmt (stmt, _stored); if (!mem) { - /* We use the shared mem_ref for all unanalyzable refs. */ - id = UNANALYZABLE_MEM_ID; - ref = memory_accesses.refs_list[id]; + /* Mark unanaylzable refs with different id and skip analysis. */ + id = memory_accesses.refs_list.length (); + ref = mem_ref_alloc (NULL, 0, id); + ref->ref_unanalyzed = true; + memory_accesses.refs_list.safe_push (ref); + record_mem_ref_loc (ref, stmt, NULL); + if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "Unanalyzed memory reference %u: ", id); @@ -1576,7 +1596,7 @@ gather_mem_refs_stmt (class loop *loop, gimple *stmt) mark_ref_stored (ref, loop); } /* A not simple memory op is also a read when it is a write. */ - if (!is_stored || id == UNANALYZABLE_MEM_ID) + if (!is_stored || REF_ID_UNANALYZABLE (id)) { bitmap_set_bit (_accesses.refs_loaded_in_loop[loop->num], ref->id); mark_ref_loaded (ref, loop); @@ -1701,6 +1721,31 @@ mem_refs_may_alias_p (im_mem_ref *mem1, im_mem_ref *mem2, poly_widest_int size1, size2; aff_tree off1, off2; + /* For refs marked as unanalyzed, use stmt_based alias analysis + and returns false when one mem_ref used by this unanalyzed stmt*/ + if (mem1->ref_unanalyzed + || mem2->ref_unanalyzed) +{ + if (mem1->ref_unanalyzed + &&
[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176 --- Comment #6 from Hongyu Wang --- (In reply to Richard Biener from comment #5) > (In reply to Hongyu Wang from comment #4) > > (In reply to Richard Biener from comment #3) > > > > > I see ret[0] has store-motion applied. You don't see it vectorized > > > because GCC doesn't know how to vectorize sincos (or cexpi which is > > > what it lowers it to). > > > > I doubt so, after manually store motion > > > > #include > > > > float foo( > > int *x, > > int n, > > float tx > > ) > > { > > float ret[n]; > > float tmp; > > > > #pragma omp simd > > for (int i = 0; i < n; i++) > > { > > float s, c; > > > > sincosf( tx * x[i] , , ); > > > > tmp += s*c; > > } > > > > ret[0] += tmp; > > > > return ret[0]; > > } > > > > with -Ofast -fopenmp-simd -std=c++11 it could be vectorized to call > > _ZGVbN4vvv_sincosf > > > > ret[0] is moved for sinf() case, but not sincosf() with above options. > > What target are you targeting? Can you provide the sincosf prototype > from your math.h? (please attach preprocessed source). > > I cannot reproduce sincosf _not_ being lowered to cexpif and thus > no longer having memory writes. > I used g++ on godbolt: https://gcc.godbolt.org/z/rv45MK Below extern is sufficient for g++ to vectorize the code __attribute__ ((__simd__ ("notinbranch"))) extern void sincosf (float __x, float *__sinx, float *__cosx); compiled with -Ofast -fopenmp-simd -std=c++11 -march=x86-64
[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176 --- Comment #4 from Hongyu Wang --- (In reply to Richard Biener from comment #3) > I see ret[0] has store-motion applied. You don't see it vectorized > because GCC doesn't know how to vectorize sincos (or cexpi which is > what it lowers it to). I doubt so, after manually store motion #include float foo( int *x, int n, float tx ) { float ret[n]; float tmp; #pragma omp simd for (int i = 0; i < n; i++) { float s, c; sincosf( tx * x[i] , , ); tmp += s*c; } ret[0] += tmp; return ret[0]; } with -Ofast -fopenmp-simd -std=c++11 it could be vectorized to call _ZGVbN4vvv_sincosf ret[0] is moved for sinf() case, but not sincosf() with above options. > > If you replace sincosf with a random call then you'll hit the issue > that LIMs dependence analysis doesn't handle it at all since it cannot > represent it. That will block further optimization in the loop. > > That can possibly be improved. > So could LIMs dependence analysis handle known library function and just analyze their memory parameter? Random call may have unknown behavior. > > if (nonpure_call_p (stmt)) > > { > > maybe_never = true; > > outermost = NULL; > > } > > > > So no store-motion chance for any future statement in such block. > > That's another issue - the call may not return. Here the granularity > is per BB and thus loads/stores in the same BB are not considered for > sinking. > IMHO the condition may be too strict for known library calls.
[Bug tree-optimization/98176] Loop invariant memory could not be hoisted when nonpure_call in loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176 --- Comment #2 from Hongyu Wang --- >> I doubt the call is the issue btw. The aliasing could be removed by float foo(int *x, int n, float tx) { float ret[n]; #pragma omp simd for (int i = 0; i < n; i++) { float s, c; s = c = tx * x[i]; ret[0] += s*c; } return ret[0]; } This is successfully vectorized, and the dump from lim2 has: Moving statement ret.1__I_lsm.7 = (*ret.1_18)[0]; But for float foo(int *x, int n, float tx) { float ret[n]; #pragma omp simd for (int i = 0; i < n; i++) { float s, c; sincosf( tx * x[i] , , ); ret[0] += s*c; } return ret[0]; } It still could not be vectorized. I did initial debugging and see tree-ssa-loop-im.c has if (nonpure_call_p (stmt)) { maybe_never = true; outermost = NULL; } So no store-motion chance for any future statement in such block. As a comparison, this could also be vectorized with simd clone: float foo(int *x, int n, float tx) { float ret[n]; #pragma omp simd for (int i = 0; i < n; i++) { float s, c; s = c = sinf( tx * x[i]); ret[0] += s*c; } return ret[0]; }
[Bug tree-optimization/98176] New: Loop invariant memory could not be hoisted when nonpure_call in loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98176 Bug ID: 98176 Summary: Loop invariant memory could not be hoisted when nonpure_call in loop body Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- For testcase #include void foo(float *x, float tx, float *ret, int n) { #pragma omp simd for (int i = 0; i < n; i++) { float s,c; sincosf(x[i] * tx, , ); *ret += s * c; } } It could not be vectorized with -Ofast -fopenmp-simd -std=c++11 https://gcc.godbolt.org/z/ba77az By manually hoist it could be vectorized with simd clone void foo(float *x, float tx, float *ret, int n) { float tmp = 0.0f; #pragma omp simd for (int i = 0; i < n; i++) { float s,c; sincosf(x[i] * tx, , ); tmp += s*c; } *ret += tmp; } https://gcc.godbolt.org/z/bea17x Is it possible for lim to perform store motion on case like this?
[Bug target/97231] Missing FSF copyright notes for some x86 intrinsic headers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97231 --- Comment #1 from Hongyu Wang --- Created attachment 49280 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49280=edit A patch
[Bug target/97231] New: Missing FSF copyright notes for some x86 intrinsic headers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97231 Bug ID: 97231 Summary: Missing FSF copyright notes for some x86 intrinsic headers Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wwwhhhyyy333 at gmail dot com Target Milestone: --- Many x86 intrinsic header files doesn't have FSF copyright: amxbf16intrin.h amxint8intrin.h amxtileintrin.h avx512vp2intersectintrin.h avx512vp2intersectvlintrin.h pconfigintrin.h tsxldtrkintrin.h wbnoinvdintrin.h