[Bug target/67325] Optimize shift (aka subreg) of load to simple load

2024-05-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67325

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #7 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115146] [15 Regression] Incorrect 8-byte vectorization: psrlw/psraw confusion

2024-05-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115146

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #13 from Hongtao Liu  ---
Fixed.

[Bug target/115161] highway-1.0.7 miscompilation of _mm_cvttps_epi32(): invalid result assumed

2024-05-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #25 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #17)
> I don't think the cost of using UNSPEC would be significant if the backend
> tried to constant fold more target builtins.  Anyway, with the proposed
> changes perhaps you could keep using FIX/UNSIGNED_FIX for flag_trapping_math
> case even for the intrinsics and use UNSPECs only for !flag_trapping_math.

Ok, we'll refactor all {V,}CVTT* instructions with UNSPEC instead of
FIX/UNSIGNED_FIX.

[Bug target/114148] gcc.target/i386/pr106010-7b.c FAILs

2024-05-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114148

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #7 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/114148] gcc.target/i386/pr106010-7b.c FAILs

2024-05-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114148

--- Comment #4 from Hongtao Liu  ---
(In reply to r...@cebitec.uni-bielefeld.de from comment #3)
> To investigate further, I've added comparison functions to a reduced
> version of pr106010-7b.c, with
> 
> void
> cmp_epi8 (_Complex unsigned char* a, _Complex unsigned char* b)
> {
>   for (int i = 0; i != N; i++)
> if (a[i] != b[i])
>   {
>   printf ("cmp_epi8: i = %d: a[i].r = %x a[i].i = %x b[i].r = %x b[i].i =
> %x\n",
>   i, __real__ a[i], __imag__ a[i], __real__ b[i], __imag__ b[i]);
>   }
> }
> 
> This shows (when using _Complex unsigned char since on Solaris, char is
> signed by default)
> 
> cmp_epi8: i = 76: a[i].r = 0 a[i].i = 0 b[i].r = 88 b[i].i = 33
> cmp_epi8: i = 77: a[i].r = 0 a[i].i = 0 b[i].r = 6 b[i].i = 8
> cmp_epi8: i = 80: a[i].r = 0 a[i].i = 0 b[i].r = 3 b[i].i = 0
> 
> I've also noticed that the test result depends on the implementation of
> malloc used:
> 
> * With Solaris libc malloc, libmalloc, and watchmalloc, the test aborts.
> 
> * However, when using one of libbsdmalloc, libmapmalloc, libmtmalloc, or
>   libumem, the test works.
> 
> However, ISTM that the test is simply bogus: in avx_test
> 
> * epi8_src, epi8_dst are allocated with malloc, buffer contents undefined
> 
> * epi8_dst is cleared
> 
> * epi8_dst is initialized from p_init
> 
> * in foo_epi8, epi8_src[0] (an undefined value) is copied into first N
>   elements of epi8_dst
> 
> * epi8_dst is compared to epi8_src (with the remaining members of epi8_src
>   still undefined)
uoops, does below patch fix the testcase on Solaris/x86?

   memcpy (pd_src, p_init, 2 * N * sizeof (double));
-  memcpy (ps_dst, p_init, 2 * N * sizeof (float));
-  memcpy (epi64_dst, p_init, 2 * N * sizeof (long long));
-  memcpy (epi32_dst, p_init, 2 * N * sizeof (int));
-  memcpy (epi16_dst, p_init, 2 * N * sizeof (short));
-  memcpy (epi8_dst, p_init, 2 * N * sizeof (char));
+  memcpy (ps_src, p_init, 2 * N * sizeof (float));
+  memcpy (epi64_src, p_init, 2 * N * sizeof (long long));
+  memcpy (epi32_src, p_init, 2 * N * sizeof (int));
+  memcpy (epi16_src, p_init, 2 * N * sizeof (short));
+  memcpy (epi8_src, p_init, 2 * N * sizeof (char));

> 
> Why on earth would the rest of epi8_dst and epi8_src be identical if
> epi8_src was never initialized?
Guess, epi8_src is all zero, and epi8_dst if set as epi8_src[0] by foo_epi8.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #16 from Hongtao Liu  ---

> 
> That said, this change really won't help the backend which supposedly should
> have the same behavior regardless of -fno-trapping-math, because in that
> case it is the value
> of the result (which is unspecified by the standards) rather than whether an
> exception is triggered or not.
First, I agree with you, they're 2 separate issues.
What I proposed is just trying to find a balance that makes it possible not to
refine all cvtt* instructions to UNSPEC, because that would lose a lot of
optimization opportunities.
If it can be restricted under flag_trapping_math, at least those intrinsics are
fine at O2/O3.

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #23 from Hongtao Liu  ---
Fixed in GCC15 and GCC14.2

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #11 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #10)
> Any of the floating point to integer intrinsics if they have out of range
> value (haven't checked whether floating point to unsigned intrinsic is a
> problem too or not).
> No matter if it is float or double (dunno if _Float16 too, or __bf16), and
> no matter if it is scalar intrinsic (ss/sd etc.) or vector and how many
> vector elements.
> But, this isn't really a regression, GCC has always behaved that way, the
> only thing that actually changed is that perhaps we can constant fold more
> than we used to do in the past.
> When not using intrinsics, IMNSHO we should keep doing what we did before.

Can we restrict them under flag_trapping_math?

.i.e

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 53f54d1d392..b7a770dad60 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2256,14 +2256,25 @@ simplify_const_unary_operation (enum rtx_code code,
machine_mode mode,
   switch (code)
{
case FIX:
+ /* According to IEEE standard, for conversions from floating point to
+integer. When a NaN or infinite operand cannot be represented in
+the destination format and this cannot otherwise be indicated, the
+invalid operation exception shall be signaled. When a numeric
+operand would convert to an integer outside the range of the
+destination format, the invalid operation exception shall be
+signaled if this situation cannot otherwise be indicated.  */
  if (REAL_VALUE_ISNAN (*x))
-   return const0_rtx;
+   return flag_trapping_math ? NULL_RTX : const0_rtx;
+
+ if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
+   return NULL_RTX;

  /* Test against the signed upper bound.  */
  wmax = wi::max_value (width, SIGNED);
  real_from_integer (, VOIDmode, wmax, SIGNED);
  if (real_less (, x))
-   return immed_wide_int_const (wmax, mode);
+   return (flag_trapping_math
+   ? NULL_RTX : immed_wide_int_const (wmax, mode));

  /* Test against the signed lower bound.  */
  wmin = wi::min_value (width, SIGNED);
@@ -2276,13 +2287,17 @@ simplify_const_unary_operation (enum rtx_code code,
machine_mode mode,

case UNSIGNED_FIX:
  if (REAL_VALUE_ISNAN (*x) || REAL_VALUE_NEGATIVE (*x))
-   return const0_rtx;
+   return flag_trapping_math ? NULL_RTX : const0_rtx;
+
+ if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
+   return NULL_RTX;

  /* Test against the unsigned upper bound.  */
  wmax = wi::max_value (width, UNSIGNED);
  real_from_integer (, VOIDmode, wmax, UNSIGNED);
  if (real_less (, x))
-   return immed_wide_int_const (wmax, mode);
+   return (flag_trapping_math
+   ? NULL_RTX : immed_wide_int_const (wmax, mode));

  return immed_wide_int_const (real_to_integer (x, , width),

[Bug target/114427] [x86] vec_pack_truncv8si/v4si can be optimized with pblendw instead of pand for AVX2 target

2024-05-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114427

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Hongtao Liu  ---
Fixed in GCC15.

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

--- Comment #4 from Hongtao Liu  ---
(In reply to Hu Lin from comment #3)
> I found compiler allocates mem to the third source register of vpternlog in
> IRA after commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e. And it cause the
> generate code will be 
> 
>   8 .cfi_startproc
>   9 movl$4, %eax
>  10 vpsraw  $5, %xmm0, %xmm2
>  11 vpbroadcastb%eax, %xmm1
>  12 movl$7, %eax
>  13 vpbroadcastb%eax, %xmm3
>  14 vmovdqa %xmm1, %xmm0
>  15 vpternlogd  $120, %xmm3, %xmm2, %xmm0
>  16 vmovdqa %xmm3, -24(%rsp)
>  17 vpsubb  %xmm1, %xmm0, %xmm0
>  18 ret
> 
> And 6a67fdcb3f0cc8be47b49ddd246d0c50c3770800 changes the vector type from
> v16qi to v4si, leading to movv4si can't combine with the vpternlog in
> postreload, so the result is what you see now.

To clarify: The extra spill is caused by r14-4944-gf55cdce3f8dd85,
r14-7026-g6a67fdcb3f0cc8 only causes an extra mov instruction(which is not a
big deal).

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #16 from Hongtao Liu  ---
> Should we also run a SPEC on with -O2 -mtune=generic -march=x86-64-v3 to see
> if there is any surprise?

Sure, I guess no.

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #14 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #13)
> (In reply to Haochen Jiang from comment #12)
> > (In reply to Hongtao Liu from comment #11)
> > > (In reply to Haochen Jiang from comment #10)
> > > > A patch like Comment 8 could definitely solve the problem. But I need to
> > > > test more benchmarks to see if there is surprise.
> > > > 
> > > > But, yes, as Uros said in Comment 9, maybe there is a chance we could 
> > > > do it
> > > > better.
> > > 
> > > Could you add "arch=skylake-avx512" to target_clones and try disable whole
> > > ix86_expand_vecop_qihi2 to see if there's any performance improvement?
> > > For x86, cross-lane permutation(truncation) is not very efficient(3-4 
> > > cycles
> > > for both vpermq and vpmovwb).
> > 
> > When I disable/enable ix86_expand_vecop_qihi2 with arch=skylake-avx512 on
> > trunk, there is no performance regression comparing to GCC13 + avx2.
> > 
> > It seems that the regression only happens when GCC14 + avx2.
> 
> This is what the patch in Comment #8 prevents. skylake-avx512 enables
> TARGET_AVX512BW, so VPMOVB is emitted instead of problematic VPERMQ.
Yes, the patch looks good to me.

[Bug target/115146] [15 Regression] Incorrect 8-byte vectorization: psrlw/psraw confusion

2024-05-19 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115146

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #10 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #8)
> (In reply to Levy Hsu from comment #5)
> > case E_V16QImode:
> >   mode = V8HImode;
> >   gen_shr = gen_vlshrv8hi3;
> >   gen_shl = gen_vashlv8hi3;
> 
> Hm, why vector-by-vector shift here? Should there be a call to gen_lshrv8hi3
> and gen_lshrv8hi3instead?

I think it's a typo, they should be gen_lshrv8hi3 and gen_lshrv8hi3.

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-19 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #11 from Hongtao Liu  ---
(In reply to Haochen Jiang from comment #10)
> A patch like Comment 8 could definitely solve the problem. But I need to
> test more benchmarks to see if there is surprise.
> 
> But, yes, as Uros said in Comment 9, maybe there is a chance we could do it
> better.

Could you add "arch=skylake-avx512" to target_clones and try disable whole
ix86_expand_vecop_qihi2 to see if there's any performance improvement?
For x86, cross-lane permutation(truncation) is not very efficient(3-4 cycles
for both vpermq and vpmovwb).

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #5 from Hongtao Liu  ---
(In reply to Krzysztof Kanas from comment #4)
> I bisected the issue and it seems that commit
> 0368fc54bc11f15bfa0ed9913fd0017815dfaa5d introduces regression.

I guess the real guilty commit is 

commit 52ff3f7b863da1011b73c0ab3b11f6c78b6451c7
Author: Uros Bizjak 
Date:   Thu May 25 19:40:26 2023 +0200

i386: Use 2x-wider modes when emulating QImode vector instructions

Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI ->
V16HImode)
instructions when available.  Currently, the compiler generates following
assembly for V16QImode multiplication (-mavx2):

[Bug target/115116] New: [x86] rtx_cost is overestimated for big size memory.

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115116

Bug ID: 115116
   Summary: [x86] rtx_cost is overestimated for big size memory.
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef char v16qi __attribute__((vector_size(16)));


v16qi
__attribute__((noipa))
foo (v16qi a)
{
  v16qi c = __extension__(v16qi)
{ 0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8,
  0x8,0x7,0x6,0x5,0x4,0x3,0x2,0x1 };
  return a * c;
}

with -O2 -march=x86-64-v4

.cfi_startproc
vpmovzxbw   .LC0(%rip), %ymm1
vpmovzxbw   %xmm0, %ymm0
vpmullw %ymm1, %ymm0, %ymm0
vpmovwb %ymm0, %xmm0
vzeroupper

but it can be optimized to 

.cfi_startproc
vpmovzxbw   %xmm0, %ymm0
vpmullw .LC0(%rip), %ymm0, %ymm0
vpmovwb %ymm0, %xmm0
vzeroupper

but failed due to cost comparison

.cfi_startproc
vpmovzxbw   %xmm0, %ymm0
vpmullw .LC0(%rip), %ymm0, %ymm0
vpmovwb %ymm0, %xmm0
vzeroupper

Successfully matched this instruction:
(set (reg:V16HI 104)
(mem/u/c:V16HI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S32 A256]))
rejecting combination of insns 6 and 10
original costs 9 + 4 = 13
replacement cost 17

For bigger mode, rtx_cost use factor = GET_MODE_SIZE / UNIT_PER_WORD, and
return cost = factor * COSTS_N_INSNS (1), that's too much for 256/512-bit
vector, they're probably loaded/stored with sse register.

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115115] [12/13/14/15 Regression] highway-1.0.7 wrong _mm_cvttps_epi32() constant fold

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115115

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
It looks same as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100927#c4

[Bug middle-end/115101] New: [wrong code] with -O1 -floop-nest-optimize for gcc.dg/graphite/interchange-8.c

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115101

Bug ID: 115101
   Summary: [wrong code] with -O1 -floop-nest-optimize for
gcc.dg/graphite/interchange-8.c
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

When I'm working on turning cunrolli, I found if cunrollis is disabled, the
testcase will fail.
It can be reproduced with trunk for -O1 -floop-nest-optimize.

[Bug target/101017] ICE: Segmentation fault, convert_memory_address_addr_space_1 with vector_size(32) and target_clone arch=core-avx2/default

2024-05-13 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101017

Hongtao Liu  changed:

   What|Removed |Added

 CC||haochen.jiang at intel dot com

--- Comment #7 from Hongtao Liu  ---

> For arch=core-avx2, function returns in reg(cfun->returns_struct), but for
> "default" and resolver function, it returns in memory(cfun->returns_struct
> == 1)
> 
> The mismatch between resolver and dispatched function caused the ICE.

similar issue for

#include 
__attribute__((target_clones ("avx512f", "default")))
__m512d foo(__m512d a, __m512d b)
{
  c = a + b;
}

with -O2, under avx512f, it returns in reg, by default it returns in memory.

[Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

2024-05-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #6 from Hongtao Liu  ---
> I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
> and rebuilt the binary and it will save half the regression.

 57.93 │200:   vaddps   0xc0(%rsp),%ymm3,%ymm5
 11.11 │   vaddps   0xe0(%rsp),%ymm2,%ymm6
...
  3.22 │   vmovdqa  %xmm1,0xc0(%rsp)
   │   vmovdqa  %xmm5,0xd0(%rsp)
  3.52 │   vmovdqa  %xmm0,0xe0(%rsp)  
   │   vmovdqa  %xmm6,0xf0(%rsp)   

I guess there're specific patterns in SKX microarhitecture for STLF, the main
difference is instruction order of those xmm stores.

>From compiler side, the worth thing to do is PR107916.

[Bug rtl-optimization/115021] New: [14/15 regression] unnecessary spill for vpternlog

2024-05-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Bug ID: 115021
   Summary: [14/15 regression] unnecessary spill for vpternlog
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef signed char v16qi __attribute__ ((__vector_size__ (16)));
 v16qi foo (v16qi x) { return x >> 5; }

with -march=x86-64-v4 -O2,  GCC 13.2 generates

foo(signed char __vector(16)):
mov eax, 4
vpsraw  xmm2, xmm0, 5
vpbroadcastbxmm1, eax
mov eax, 7
vpbroadcastbxmm3, eax
vmovdqa xmm0, xmm1
vpternlogd  xmm0, xmm2, xmm3, 120
vpsubb  xmm0, xmm0, xmm1
ret

GCC 14.1 generates

foo(signed char __vector(16)):
mov eax, 67372036
vpsraw  xmm2, xmm0, 5
vpbroadcastdxmm1, eax
mov eax, 117901063
vpbroadcastdxmm3, eax
vmovdqa xmm0, xmm1
vmovdqa XMMWORD PTR [rsp-24], xmm3
vpternlogd  xmm0, xmm2, XMMWORD PTR [rsp-24], 120
vpsubb  xmm0, xmm0, xmm1
ret

There's extra spill.

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2024-05-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #20 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/113090] Suboptimal vector permuation for 64-bit vector.

2024-05-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113090

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #2 from Hongtao Liu  ---
Fixed in GCC15

[Bug target/113079] [x86] Fails to generate dot_prod instructions for 64-bit vector.

2024-05-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113079

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/114943] X86 AVX2: inefficient code generated to convert SIMD Vectors

2024-05-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114943

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #1 from Hongtao Liu  ---
For convert3, we already have a patch for it, and will post soon.
For convert, the current loop vectorizer has a limitation to keep the same
vector length while vectorizing, thus generating extra packing/unpacking
instructions compared to convert2. But there's no such limitation in BB
vectorizer, so w/ -O3 -march=x86-64-v3 -fno-tree-loop-vectorize, convert is as
good as convert2

[Bug libgcc/114907] __trunchfbf2 should be renamed to __extendhfbf2

2024-05-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114907

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #6)
> Neither of the modes is a subset or superset of the other, so truncate vs.
> extend makes no sense.  The choice was arbitrary.

It sounds to me we need to define an alias of
strong_alias (__trunchfbf2, __extendhfbf2)

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883

--- Comment #10 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #9)
> Created attachment 58073 [details]
> gcc14-pr114883.patch
> 
> Full untested patch.

This will fix 521.wrf_r ICE, and pass runtime validation.

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883

--- Comment #5 from Hongtao Liu  ---

(In reply to Hongtao Liu from comment #4)
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index a6cf0a5546c..ae6abe00f3e 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8505,7 +8505,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>  {
>gcc_assert (code == IFN_COND_ADD || code == IFN_COND_SUB
>   || code == IFN_COND_MUL || code == IFN_COND_AND
> - || code == IFN_COND_IOR || code == IFN_COND_XOR);
> + || code == IFN_COND_IOR || code == IFN_COND_XOR
> + || code == IFN_COND_MIN);
>gcc_assert (op.num_ops == 4
>   && (op.ops[reduc_index]
>   == op.ops[internal_fn_else_index ((internal_fn)
> code)]));
> 
> Could fix the ICE.

Generate code as:

74  vect__38.89_332 = {_289, _295, _301, _307, _313, _319, _325, _331};
475  vect__39.90_333 = vect__36.85_228 * vect__38.88_283;
476  vect__39.90_334 = vect__36.85_229 * vect__38.89_332;
477  vect_tinv_80.91_335 = .FMA (vect__33.78_194, vect__32.75_211,
vect__39.90_333);
478  vect_tinv_80.91_336 = .FMA (vect__33.79_107, vect__32.75_207,
vect__39.90_334);
479  mask__217.92_338 = vect_tinv_80.91_335 > {
9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7, 9.99974752\
   427078783512115478515625e-7, 9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7, 9.999747524270787835121154\
   78515625e-7 };
480  mask__217.92_339 = vect_tinv_80.91_336 > {
9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7, 9.99974752\
   427078783512115478515625e-7, 9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7,
9.99974752427078783512115478515625e-7, 9.999747524270787835121154\
   78515625e-7 };
481  vect_dtt_84.93_342 = .COND_RDIV (mask__217.92_338, { 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 }, vect_tinv_80.91_335, { 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 });
482  vect_dtt_84.93_343 = .COND_RDIV (mask__217.92_339, { 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 }, vect_tinv_80.91_336, { 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 });
483  vect_M.94_344 = .COND_MIN (mask__217.92_338, vect_dtmin_119.64_128,
vect_dtt_84.93_342, vect_dtt_84.93_342);
484  vect_M.94_345 = .COND_MIN (mask__217.92_339, vect_dtt_84.93_343,
vect_M.94_344, vect_dtt_84.93_343);
485  ivtmp.102_164 = ivtmp.102_119 + 128;
486  ivtmp.105_196 = ivtmp.105_193 + 128;
487  ivtmp.109_136 = ivtmp.109_177 + 128;
488  if (_118 == ivtmp.102_164)
489goto ; [36.35%]
490  else
491goto ; [63.65%]
492
493   [local count: 54066899]:
494  _347 = .REDUC_MIN (vect_M.94_345);

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883

--- Comment #4 from Hongtao Liu  ---
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a6cf0a5546c..ae6abe00f3e 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8505,7 +8505,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 {
   gcc_assert (code == IFN_COND_ADD || code == IFN_COND_SUB
  || code == IFN_COND_MUL || code == IFN_COND_AND
- || code == IFN_COND_IOR || code == IFN_COND_XOR);
+ || code == IFN_COND_IOR || code == IFN_COND_XOR
+ || code == IFN_COND_MIN);
   gcc_assert (op.num_ops == 4
  && (op.ops[reduc_index]
  == op.ops[internal_fn_else_index ((internal_fn) code)]));

Could fix the ICE.

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883

--- Comment #3 from Hongtao Liu  ---
Created attachment 58066
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58066=edit
reproduced testcase

gfortran -O2 -march=x86-64-v4 -fvect-cost-model=cheap.

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883

--- Comment #2 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> Can you reduce the fortran code down for the ICE? It should not be hard, you
> can use delta even.

Let me try.

[Bug tree-optimization/114883] New: 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883

Bug ID: 114883
   Summary: 521.wrf_r ICE with -O2 -march=sapphirerapids
-fvect-cost-model=cheap
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

during GIMPLE pass: vect
dump file: module_cam_mp_ndrop.fppized.f90.179t.vect
module_cam_mp_ndrop.fppized.f90:33:27:

   33 |   subroutine dropmixnuc(lchnk, ncol, ncldwtr,tendnd, temp,omega,  &
  |   ^
internal compiler error: in vect_transform_reduction, at tree-vect-loop.cc:8506
0x8c8009 vect_transform_reduction(_loop_vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, gimple**, _slp_tree*)
/iusers/liuhongt/work/gcc-14/gcc/tree-vect-loop.cc:8506
0x2959895 vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
/iusers/liuhongt/work/gcc-14/gcc/tree-vect-stmts.cc:13447
0x185a31a vect_transform_loop_stmt
/iusers/liuhongt/work/gcc-14/gcc/tree-vect-loop.cc:11561
0x18794e2 vect_transform_loop(_loop_vec_info*, gimple*)
/iusers/liuhongt/work/gcc-14/gcc/tree-vect-loop.cc:12087
0x18c3544 vect_transform_loops
/iusers/liuhongt/work/gcc-14/gcc/tree-vectorizer.cc:1006
0x18c3bc3 try_vectorize_loop_1
/iusers/liuhongt/work/gcc-14/gcc/tree-vectorizer.cc:1152
0x18c3bc3 try_vectorize_loop
/iusers/liuhongt/work/gcc-14/gcc/tree-vectorizer.cc:1182
0x18c4224 execute
/iusers/liuhongt/work/gcc-14/gcc/tree-vectorizer.cc:1298
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.


vect dump, vectorization for reduction of COND_MIN and ICE.

 9014  MEM  [(real(kind=8) *)vectp.2627_3771] =
vect__458.2626_3767;
 9015  vect_tinv_1357.2629_3773 = vect__453.2615_3744 + vect__458.2626_3766;
 9016  vect_tinv_1357.2629_3774 = vect__453.2615_3745 + vect__458.2626_3767;
 9017  tinv_1357 = _453 + _458;
 9018  mask__2039.2630_3776 = vect_vec_iv_.2598_3718 == vect_cst__3775;
 9019  _2039 = k.864_1980 == prephitmp_3150;
 9020  mask_patt_3658.2631_3777 = [vec_unpack_lo_expr] mask__2039.2630_3776;
 9021  mask_patt_3658.2631_3778 = [vec_unpack_hi_expr] mask__2039.2630_3776;
 9022  vect_patt_3659.2632_3781 = .COND_ADD (mask_patt_3658.2631_3777,
vect_tinv_1357.2629_3773, vect_cst__3779, vect_cst__3780);
 9023  vect_patt_3659.2632_3782 = .COND_ADD (mask_patt_3658.2631_3778,
vect_tinv_1357.2629_3774, vect_cst__3779, vect_cst__3780);
 9024  tinv_1766 = 0.0;
 9025  mask_patt_3660.2633_3783 = [vec_unpack_lo_expr] mask__2039.2630_3776;
 9026  mask_patt_3660.2633_3784 = [vec_unpack_hi_expr] mask__2039.2630_3776;
 9027  vect_patt_3661.2634_3786 = .COND_ADD (mask_patt_3660.2633_3783,
vect_patt_3659.2632_3781, vect_cst__3785, vect_tinv_1357.2629_3773);
 9028  vect_patt_3661.2634_3787 = .COND_ADD (mask_patt_3660.2633_3784,
vect_patt_3659.2632_3782, vect_cst__3785, vect_tinv_1357.2629_3774);
 9029  tinv_1359 = 0.0;
 9030  mask__2017.2635_3789 = vect_patt_3661.2634_3786 > vect_cst__3788;
 9031  mask__2017.2635_3790 = vect_patt_3661.2634_3787 > vect_cst__3788;
 9032  _2017 = tinv_1359 > 9.99974752427078783512115478515625e-7;
 9033  vect_dtt_1360.2636_3793 = .COND_RDIV (mask__2017.2635_3789,
vect_cst__3791, vect_patt_3661.2634_3786, vect_cst__3792);
 9034  vect_dtt_1360.2636_3794 = .COND_RDIV (mask__2017.2635_3790,
vect_cst__3791, vect_patt_3661.2634_3787, vect_cst__3792);
 9035  dtt_1360 = 0.0;
 9036  M.287_1361 = .COND_MIN (_2017, dtt_1360, dtmin_1992, dtmin_1992);
 9037  _459 = k.864_1980 + 1;
 9038  vectp.2602_3727 = vectp.2602_3729 + 32;
 9039  vectp.2606_3733 = vectp.2606_3735 + 32;
 9040  vectp.2611_3740 = vectp.2611_3742 + 32;
 9041  vectp.2616_3747 = vectp.2616_3749 + 32;
 9042  vectp.2618_3752 = vectp.2618_3754 + 32;
 9043  vectp.2627_3769 = vectp.2627_3771 + 32;
 9044  if (_459 > prephitmp_3150)
 9045goto ; [11.00%]
 9046  else
 9047goto ; [89.00%]
 9048


Part of source code in  module_cam_mp_ndrop.fppized.f90, the ICE loop.

 630 do k=1,pver
 631km1=max0(k-1,1)
 632ekkp(k)=zn(k)*ekk(k)*zs(k)
 633ekkm(k)=zn(k)*ekk(k-1)*zs(km1)
 634tinv=ekkp(k)+ekkm(k)
 635
 636if(k.eq.pver)tinv=tinv+surfratemax
 637! rce-comment -- tinv is the sum of all first-order-loss-rates
 638!for the layer.  for most layers, the activation loss rate
 639!(for interstitial particles) is accounted for by the loss by
 640!turb-transfer to the layer above.
 641!k=pver is special, and the loss rate for activation within
 642!the layer must be added to tinv.  if not, the time step
 643!can be too big, and explmix can produce negative values.
 

[Bug target/110621] x86_64: Test gcc.target/i386/pr105354-2.c fails with -fstack-protector

2024-04-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110621

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao Liu  ---
fixed.

[Bug target/85048] [missed optimization] vector conversions

2024-04-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #16 from Hongtao Liu  ---
(In reply to Matthias Kretz (Vir) from comment #15)
> So it seems that if at least one of the vector builtins involved in the
> expression is 512 bits GCC needs to locally increase prefer-vector-width to
> 512? Or, more generally:
> 
> prefer-vector-width = max(prefer-vector-width, 8 * sizeof(operands)..., 8 *
> sizeof(return-value))
> 
> The reason to default to 256 bits is to avoid zmm register usage altogether
> (clock-down). But if the surrounding code already uses zmm registers that
> motivation is moot.
> 
> Also, I think this shouldn't be considered auto-vectorization but rather
> pattern recognition (recognizing a __builtin_convertvector).

The related question is "should GCC set prefer-vector-width=512" when 512-bit
intrinsics is used. There may be a situation where users don't want compiler to
generate zmm except for those 512-bit intrinsics in their program, i.e the hot
loop is written with 512-bit intrinsics for performance purpose, but for other
places, better no zmm usage.

[Bug target/85048] [missed optimization] vector conversions

2024-04-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #14 from Hongtao Liu  ---
(In reply to Matthias Kretz (Vir) from comment #13)
> Should I open a new PR for the remaining ((u)int64, 16) <-> (float, 16)
> conversions?
> 
> https://godbolt.org/z/x3xPMYKj3
> 
> Note that __builtin_convertvector produces the code we want.
> 

With -mprefer-vector-width=512, GCC generate produces the same code.
Default tuning for -march=skylake-avx512 is -mprefer-vector-width=256.

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #7 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #4)
> (In reply to Hongtao Liu from comment #3)
> > Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.
> 
> Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
> ix86_builtin_vectorization_cost, but not add_stmt_cost/finish_cost.

Oh, CTOR comes from source code, not from vectorizer.
Then why those loads from offset is not moved just before consumer(loads from
array), then the live range of those values can be shorten.(loads from array
are moved just before CTOR insns).

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #4 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #3)
> Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.

Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
ix86_builtin_vectorization_cost, but not add_stmt_cost/finish_cost.

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #16 from Hongtao Liu  ---

> 
> 4952  /* See if a MEM has already been loaded with a widening operation;
> 4953 if it has, we can use a subreg of that.  Many CISC machines
> 4954 also have such operations, but this is only likely to be
> 4955 beneficial on these machines.  */

Oh, it's pre_reload cse_insn, not postreload gcse

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #15 from Hongtao Liu  ---
> I don't see this as problematic. IIRC, there was a discussion in the past
> that a couple (two?) memory accesses from the same location close to each
> other can be faster (so, -O2, not -Os) than preloading the value to the
> register first.
At lease for memory with vector mode, it's better to preload the value to
register first.
> 
> In contrast, the example from the Comment #11 already has the correct value
> in %eax, so there is no need to reload it again from memory, even in a
> narrower mode.

So the problem is why cse can't handle same memory with narrower mode, maybe
it's because there's zero_extend in the first load. cse looks like can handle
simple wider mode memory.

4952  /* See if a MEM has already been loaded with a widening operation;
4953 if it has, we can use a subreg of that.  Many CISC machines
4954 also have such operations, but this is only likely to be
4955 beneficial on these machines.  */

[Bug middle-end/110027] [11/12/13/14 regression] Stack objects with extended alignments (vectors etc) misaligned on detect_stack_use_after_return

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #19 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #17)
> Both of the posted patches are incorrect, this needs to be fixed in
> asan_emit_stack_protection, account for the different offsets[0] which
> happens when a stack pointer guard is created.
> I'll deal with it tomorrow.

It seems to me that the only offend place is where I've modifed, are there
other places where align_frame_offset (ASAN_RED_ZONE_SIZE) is also added?

Also, your patch adds a gcc_assert for offset[0], which seems to me there was
an assumption that offset[0] should be a multiple of alignb, thus making my
patch more reasonable?

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #12 from Hongtao Liu  ---
short a;
short c;
short d;
void
foo (short b, short f)
{
  c = b + a;
  d = f + a;
}

foo(short, short):
addwa(%rip), %di
addwa(%rip), %si
movw%di, c(%rip)
movw%si, d(%rip)
ret

this one is bad since gcc10.1 and there's no subreg, The problem is if the
operand is used by more than 1 insn, and they all support separate m
constraint, mem_cost is quite small(just 1, reg move cost is 2), and this makes
RA more inclined to propagate memory across insns. I guess RA assumes the
separate m means the insn only support memory_operand?

 961  if (op_class == NO_REGS)
 962/* Although we don't need insn to reload from
 963   memory, still accessing memory is usually more
 964   expensive than a register.  */
 965pp->mem_cost = frequency;
 966  else

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #11 from Hongtao Liu  ---
unsigned v;
long long v2;
char foo ()
{
v2 = v;
return v;
}

This is related to *movqi_internal, and codegen has been worse since gcc8.1

foo:
movlv(%rip), %eax
movq%rax, v2(%rip)
movzbl  v(%rip), %eax
ret

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #9 from Hongtao Liu  ---

> 
> It looks that different modes of memory read confuse LRA to not CSE the read.
> 
> IMO, if the preloaded value is later accessed in different modes, LRA should
> leave it. Alternatively, LRA should CSE memory accesses in different modes.

(insn 7 6 12 2 (set (reg:HI 101 [ _5 ])
(subreg:HI (reg:SI 98 [ v1.0_1 ]) 0)) "test.c":6:12 86
{*movhi_internal}
 (expr_list:REG_DEAD (reg:SI 98 [ v1.0_1 ])
(nil)))

May be we should reduce cost from simple move instruction(with subreg?) when
calculating total_cost, since it's probably be eliminated by later rtl
optimization.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #5 from Hongtao Liu  ---
> My experience is memory cost for the operand with rm or separate r, m is
> different which impacts RA decision.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html

Change operands[1] alternative 2 from m -> rm, then RA makes perfect decision.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #3)
> (In reply to Jakub Jelinek from comment #2)
> > This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6
> 
> Strange, if I revert the constraints to the previous setting with: 
> 
> --cut here--
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 10ae3113ae8..262dd25a8e0 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -2870,9 +2870,9 @@ (define_peephole2
>  
>  (define_insn "*movhi_internal"
>[(set (match_operand:HI 0 "nonimmediate_operand"
> -"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m")
> +"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m")
> (match_operand:HI 1 "general_operand"
> -"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C  ,*v,m ,*x,*v"))]
> +"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r  ,C  ,v,m ,x,v"))]
>"!(MEM_P (operands[0]) && MEM_P (operands[1]))
> && ix86_hardreg_mov_ok (operands[0], operands[1])"
>  {
> --cut here--
> 
> I still get:
> 
> movlv1(%rip), %eax  # 6 [c=6 l=6]  *zero_extendsidi2/3
> movq%rax, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
> movzwl  v1(%rip), %eax  # 7 [c=5 l=7]  *movhi_internal/2

My experience is memory cost for the operand with rm or separate r, m is
different which impacts RA decision.

https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html

[Bug tree-optimization/66862] OpenMP SIMD does not work (use SIMD instructions) on conditional code

2024-04-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao Liu  ---
Fixed since GCC8 w/ avx512bw avx512vl.
w/o avx512bw, x86 doesn't support packed int16 mask{load,store}, and can't
vectorize the loop.

[Bug target/113288] [i386] Missing #define for -mavx10.1-256 and -mavx10.1-512

2024-04-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113288

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Hongtao Liu  ---
.

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

--- Comment #3 from Hongtao Liu  ---
 <__umodti3>:
...

 37  58:   66 48 0f 6e c7  movq   %rdi,%xmm0
 38  5d:   66 48 0f 6e d6  movq   %rsi,%xmm2
 39  62:   66 0f 6c c2 punpcklqdq %xmm2,%xmm0
 40  66:   0f 29 44 24 f0  movaps %xmm0,-0x10(%rsp)
 41  6b:   48 8b 44 24 f0  mov-0x10(%rsp),%rax
 42  70:   48 8b 54 24 f8  mov-0x8(%rsp),%rdx
 43  75:   5b  pop%rbx
 44  76:   c3  ret

Look like the misoptimization is also in __umodti3.

[Bug target/114570] New: GCC doesn't perform good loop invariant code motion for very long vector operations.

2024-04-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114570

Bug ID: 114570
   Summary: GCC doesn't perform good loop invariant code motion
for very long vector operations.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef float v128_32 __attribute__((vector_size (128 * 4), aligned(2048)));
v128_32
foo (v128_32 a, v128_32 b, v128_32 c, int n)
{
for (int i = 0; i != 2048; i++)
{
a = a / c;
a = a / b;
}
return a;
}

   [local count: 1063004408]:
  # a_13 = PHI 
  # ivtmp_2 = PHI 
  # DEBUG i => NULL
  # DEBUG a => NULL
  # DEBUG BEGIN_STMT
  _14 = BIT_FIELD_REF ;
  _15 = BIT_FIELD_REF ;
  _10 = _14 / _15;
  _11 = BIT_FIELD_REF ;
  _12 = BIT_FIELD_REF ;
  _16 = _11 / _12;
  _17 = BIT_FIELD_REF ;
  _18 = BIT_FIELD_REF ;
  _19 = _17 / _18;
  _20 = BIT_FIELD_REF ;
  _21 = BIT_FIELD_REF ;
  _22 = _20 / _21;
  _23 = BIT_FIELD_REF ;
  _24 = BIT_FIELD_REF ;
  _25 = _23 / _24;
  _26 = BIT_FIELD_REF ;
  _27 = BIT_FIELD_REF ;
  _28 = _26 / _27;
  _29 = BIT_FIELD_REF ;
  _30 = BIT_FIELD_REF ;
  _31 = _29 / _30;
  _32 = BIT_FIELD_REF ;
  _33 = BIT_FIELD_REF ;
  _34 = _32 / _33;
  _35 = BIT_FIELD_REF ;
  _36 = BIT_FIELD_REF ;
  _37 = _35 / _36;
  _38 = BIT_FIELD_REF ;
  _39 = BIT_FIELD_REF ;
  _40 = _38 / _39;
  _41 = BIT_FIELD_REF ;
  _42 = BIT_FIELD_REF ;
  _43 = _41 / _42;
  _44 = BIT_FIELD_REF ;
  _45 = BIT_FIELD_REF ;
  _46 = _44 / _45;
  _47 = BIT_FIELD_REF ;
  _48 = BIT_FIELD_REF ;
  _49 = _47 / _48;
  _50 = BIT_FIELD_REF ;
  _51 = BIT_FIELD_REF ;
  _52 = _50 / _51;
  _53 = BIT_FIELD_REF ;
  _54 = BIT_FIELD_REF ;
  _55 = _53 / _54;
  _56 = BIT_FIELD_REF ;
  _57 = BIT_FIELD_REF ;
  _58 = _56 / _57;
  # DEBUG a => {_10, _16, _19, _22, _25, _28, _31, _34, _37, _40, _43, _46,
_49, _52, _55, _58}
  # DEBUG BEGIN_STMT
  _59 = BIT_FIELD_REF ;
  _60 = _10 / _59;
  _61 = BIT_FIELD_REF ;
  _62 = _16 / _61;
  _63 = BIT_FIELD_REF ;
  _64 = _19 / _63;
  _65 = BIT_FIELD_REF ;
  _66 = _22 / _65;
  _67 = BIT_FIELD_REF ;
  _68 = _25 / _67;
  _69 = BIT_FIELD_REF ;
  _70 = _28 / _69;
  _71 = BIT_FIELD_REF ;
  _72 = _31 / _71;
  _73 = BIT_FIELD_REF ;
  _74 = _34 / _73;
  _75 = BIT_FIELD_REF ;
  _76 = _37 / _75;
  _77 = BIT_FIELD_REF ;
  _78 = _40 / _77;
  _79 = BIT_FIELD_REF ;
  _80 = _43 / _79;
  _81 = BIT_FIELD_REF ;
  _82 = _46 / _81;
  _83 = BIT_FIELD_REF ;
  _84 = _49 / _83;
  _85 = BIT_FIELD_REF ;
  _86 = _52 / _85;
  _87 = BIT_FIELD_REF ;
  _88 = _55 / _87;
  _89 = BIT_FIELD_REF ;
  _90 = _58 / _89;
  a_9 = {_60, _62, _64, _66, _68, _70, _72, _74, _76, _78, _80, _82, _84, _86,
_88, _90};
  # DEBUG a => a_9
  # DEBUG BEGIN_STMT
  # DEBUG i => NULL
  # DEBUG a => a_9
  # DEBUG BEGIN_STMT
  ivtmp_1 = ivtmp_2 + 4294967295;
  if (ivtmp_1 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

Ideally, those BIT_FIELD_REF can be hoisted out and 
# a_13 = PHI  can be optimized with those 256-bit vectors.

we finanly generate 

foo:
pushq   %rbp
movq%rdi, %rax
movl$2048, %edx
movq%rsp, %rbp
subq$408, %rsp
leaq-120(%rsp), %r8
.L2:
vmovaps 16(%rbp), %ymm15
vmovaps 48(%rbp), %ymm14
movq%r8, %rsi
vdivps  1040(%rbp), %ymm15, %ymm15
vmovaps 80(%rbp), %ymm13
vmovaps 112(%rbp), %ymm12
vdivps  528(%rbp), %ymm15, %ymm15
vdivps  1072(%rbp), %ymm14, %ymm14
vmovaps 144(%rbp), %ymm11
vmovaps 176(%rbp), %ymm10
vdivps  560(%rbp), %ymm14, %ymm14
vdivps  1104(%rbp), %ymm13, %ymm13
vmovaps 208(%rbp), %ymm9
vmovaps 240(%rbp), %ymm8
vdivps  592(%rbp), %ymm13, %ymm13
vdivps  1136(%rbp), %ymm12, %ymm12
vmovaps 272(%rbp), %ymm7
vmovaps 304(%rbp), %ymm6
vdivps  624(%rbp), %ymm12, %ymm12
vdivps  1168(%rbp), %ymm11, %ymm11
vmovaps 336(%rbp), %ymm5
vdivps  656(%rbp), %ymm11, %ymm11
vdivps  1200(%rbp), %ymm10, %ymm10
vdivps  1232(%rbp), %ymm9, %ymm9
vdivps  688(%rbp), %ymm10, %ymm10
vdivps  720(%rbp), %ymm9, %ymm9
vdivps  1264(%rbp), %ymm8, %ymm8
vdivps  1296(%rbp), %ymm7, %ymm7
vdivps  752(%rbp), %ymm8, %ymm8
vdivps  784(%rbp), %ymm7, %ymm7
vdivps  1328(%rbp), %ymm6, %ymm6
movl$64, %ecx
vdivps  816(%rbp), %ymm6, %ymm6
leaq16(%rbp), %rdi
vdivps  1360(%rbp), %ymm5, %ymm5
vdivps  848(%rbp), %ymm5, %ymm5
vmovaps 368(%rbp), %ymm4
vmovaps 400(%rbp), %ymm3
vdivps  1392(%rbp), %ymm4, %ymm4
vdivps  1424(%rbp), %ymm3, %ymm3
vmovaps 432(%rbp), %ymm2
vmovaps 464(%rbp), %ymm1
vdivps  880(%rbp), %ymm4, %ymm4
vdivps  912(%rbp), %ymm3, %ymm3
vmovaps 

[Bug rtl-optimization/114556] New: weird loop unrolling when there's attribute aligned in side the loop

2024-04-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114556

Bug ID: 114556
   Summary: weird loop unrolling when there's attribute aligned in
side the loop
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

v32qi
z (void* pa, void* pb, void* pc)
{
v32qi __attribute__((aligned(64))) a;
v32qi __attribute__((aligned(64))) b;
v32qi __attribute__((aligned(64))) c;
__builtin_memcpy (, pa, sizeof (a));
__builtin_memcpy (, pb, sizeof (a));
__builtin_memcpy (, pc, sizeof (a));
#pragma GCC unroll 8
for (int i = 0; i != 2048; i++)
  a += b;
  return a;
}

-O2 -mavx2, we have 

z:
vmovdqu (%rsi), %ymm1
vpaddb  (%rdi), %ymm1, %ymm0
movl$2041, %eax
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
jmp .L2
.L3:
vpaddb  %ymm0, %ymm1, %ymm0
subl$8, %eax
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
.L2:
vpaddb  %ymm0, %ymm1, %ymm0
cmpl$1, %eax
jne .L3
ret

But shouldn't it better with

z:
vmovdqu (%rsi), %ymm1
vmovdqu (%rdi), %ymm0
movl$2048, %eax
.L2:
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
subl$8, %eax
jne .L2
ret

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

--- Comment #2 from Hongtao Liu  ---
Also for 
void
foo2 (v128_t* a, v128_t* b)
{
   c = (*a & *b)+ *b;
}

(insn 9 8 10 2 (set (reg:V1TI 108 [ _3 ])
(and:V1TI (reg:V1TI 99 [ _2 ])
(mem:V1TI (reg:DI 113) [1 *a_6(D)+0 S16 A128])))
"/app/example.c":49:12 7100 {andv1ti3}
 (expr_list:REG_DEAD (reg:DI 113)
(nil)))
(insn 10 9 13 2 (parallel [
(set (reg:TI 109 [ _11 ])
(plus:TI (subreg:TI (reg:V1TI 108 [ _3 ]) 0)
(subreg:TI (reg:V1TI 99 [ _2 ]) 0)))
(clobber (reg:CC 17 flags))
]) "/app/example.c":49:17 256 {*addti3_doubleword}
 (expr_list:REG_DEAD (reg:V1TI 108 [ _3 ])
(expr_list:REG_DEAD (reg:V1TI 99 [ _2 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)

Since V1TImode can only be accocated as SSE_REGS, reload use stack for
(subreg:TI (reg:V1TI 108 [ _3 ]) 0) since the latter only support GENERAL_REGS.

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

--- Comment #1 from Hongtao Liu  ---
20590;; Turn SImode or DImode extraction from arbitrary SSE/AVX/AVX512F
20591;; vector modes into vec_extract*.
20592(define_split
20593  [(set (match_operand:SWI48x 0 "nonimmediate_operand")
20594(subreg:SWI48x (match_operand 1 "register_operand") 0))]
20595  "can_create_pseudo_p ()
20596   && REG_P (operands[1])
20597   && VECTOR_MODE_P (GET_MODE (operands[1]))
20598   && ((TARGET_SSE && GET_MODE_SIZE (GET_MODE (operands[1])) == 16)
20599   || (TARGET_AVX && GET_MODE_SIZE (GET_MODE (operands[1])) == 32)
20600   || (TARGET_AVX512F && TARGET_EVEX512
20601   && GET_MODE_SIZE (GET_MODE (operands[1])) == 64))
20602   && (mode == SImode || TARGET_64BIT || MEM_P (operands[0]))"
20603  [(set (match_dup 0) (vec_select:SWI48x (match_dup 1)
20604 (parallel [(const_int 0)])))]
20605{
20606  rtx tmp;

We need to do something similar.

[Bug target/114544] New: [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

Bug ID: 114544
   Summary: [x86] stv should transform (subreg DI (V1TI) 8) as
(vec_select:DI (V2DI) (const_int 1))
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef __uint128_t v128_t __attribute__((vector_size(16)));

v128_t c;


v128_t
foo1 (v128_t *a, v128_t *b)
{
c =  (*a >> 1 & *b) / (__extension__(v128_t){(__int128_t)0x3 << 120
| (__int128_t)0x3 << 112
| (__int128_t)0x3 << 104
| (__int128_t)0x3 << 96
| (__int128_t)0x3 << 88
| (__int128_t)0x3 << 80
| (__int128_t)0x3 << 72
| (__int128_t)0x3 << 64
| (__int128_t)0x3 << 56
| (__int128_t)0x3 << 48
| (__int128_t)0x3 << 40
| (__int128_t)0x3 << 32
| (__int128_t)0x3 << 24
| (__int128_t)0x3 << 16
| (__int128_t)0x3 << 8
| (__int128_t)0x3 << 0});
}


stv generates

(insn 32 11 35 2 (set (reg:DI 124 [ _4 ])
(subreg:DI (reg:V1TI 111 [ _4 ]) 0)) "/app/example.c":28:25 84
{*movdi_internal}
 (nil))
(insn 35 32 12 2 (set (reg:DI 127 [+8 ])
(subreg:DI (reg:V1TI 111 [ _4 ]) 8)) "/app/example.c":28:25 84
{*movdi_internal}
 (expr_list:REG_DEAD (reg:V1TI 111 [ _4 ])

(subreg:DI (reg:V1TI 111 [ _4 ]) 8) makes reload spills.


foo1:
movabsq $217020518514230019, %rdx # 57  [c=1 l=10] 
*movdi_internal/4
subq$24, %rsp   # 59  [c=4 l=4] 
pro_epilogue_adjust_stack_add_di/0
vmovdqa (%rdi), %xmm0 # 8   [c=9 l=4]  movv1ti_internal/3
movq%rdx, %rcx  # 58[c=4 l=3]  *movdi_internal/3
vpsrldq $8, %xmm0, %xmm1  # 42[c=4 l=5]  sse2_lshrv1ti3/1
vpsrlq  $1, %xmm0, %xmm0# 45[c=4 l=5]  lshrv2di3/1
vpsllq  $63, %xmm1, %xmm1   # 46  [c=4 l=5]  ashlv2di3/1
vpor%xmm1, %xmm0, %xmm0 # 47  [c=4 l=4]  *iorv2di3/1
vpand   (%rsi), %xmm0, %xmm2  # 10[c=13 l=4]  andv1ti3/1
vmovdqa %xmm2, (%rsp) # 52  [c=4 l=5]  movv1ti_internal/4
movq(%rsp), %rdi# 56[c=5 l=4]  *movdi_internal/3
movq8(%rsp), %rsi   # 35  [c=9 l=5]  *movdi_internal/3
call__udivti3   # 19  [c=13 l=5]  *call_value
vmovq   %rax, %xmm3   # 53  [c=4 l=5]  *movdi_internal/20
vpinsrq $1, %rdx, %xmm3, %xmm0# 23[c=4 l=6]  vec_concatv2di/2
vmovdqa %xmm0, c(%rip)# 25[c=4 l=8]  movv1ti_internal/4
addq$24, %rsp   # 62  [c=4 l=4] 
pro_epilogue_adjust_stack_add_di/0
ret   # 63[c=0 l=1]  simple_return_internal

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

--- Comment #3 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> Confirmed.
> 
> Note non sign bit can be improved too:
> ```
I assume you're talking about broadcast from imm or directly from constant
pool. GCC chooses the former, with -Os we can also generate the later.
According to microbenchmark, the former is better. I also tries to disable
broadcasting from imm and test with stress-ng vecmath, the performance is
similar.

[Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

Bug ID: 114514
   Summary: v16qi >> 7 can be optimized with vpcmpgtb
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 7;
}

it can be optimized with
vpxor   xmm1, xmm1, xmm1
vpcmpgtbxmm0, xmm1, xmm0
ret

currently we generate(emulated with v16hi)

movl$16843009, %eax
vpsraw  $7, %xmm0, %xmm0
vmovd   %eax, %xmm1
vpbroadcastd%xmm1, %xmm1
vpandn  %xmm1, %xmm0, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0

[Bug tree-optimization/114471] [14 regression] ICE when building liblc3-1.0.4 with -fno-vect-cost-model -march=x86-64-v4

2024-03-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471

--- Comment #6 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #5)
> Maybe we should always use kmask under AVX512, currently only >= 128-bits
> vector of vector _Float16 use kmask, < 128 bits vector still use vector mask.
> 
and we need to support vec_cmp/vcond_mask for 64/32/16-bit vectors.
For the testcase, there's no kmask used at all, why x86-64-v3 doesn't issue an
error.

[Bug tree-optimization/114471] [14 regression] ICE when building liblc3-1.0.4 with -fno-vect-cost-model -march=x86-64-v4

2024-03-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #5 from Hongtao Liu  ---
Maybe we should always use kmask under AVX512, currently only >= 128-bits
vector of vector _Float16 use kmask, < 128 bits vector still use vector mask.

24628  /* Scalar mask case.  */
24629  if ((TARGET_AVX512F && TARGET_EVEX512 && vector_size == 64)
24630  || (TARGET_AVX512VL && (vector_size == 32 || vector_size == 16))
24631  /* AVX512FP16 only supports vector comparison
24632 to kmask for _Float16.  */
24633  || (TARGET_AVX512VL && TARGET_AVX512FP16
24634  && GET_MODE_INNER (data_mode) == E_HFmode))
24635{
24636  if (elem_size == 4
24637  || elem_size == 8
24638  || (TARGET_AVX512BW && (elem_size == 1 || elem_size == 2)))
24639return smallest_int_mode_for_size (nunits);
24640}
24641
24642  scalar_int_mode elem_mode
24643= smallest_int_mode_for_size (elem_size * BITS_PER_UNIT);
24644
24645  gcc_assert (elem_size * nunits == vector_size);
24646
24647  return mode_for_vector (elem_mode, nunits);

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #3 from Hongtao Liu  ---
Then invalid.

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

--- Comment #2 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #1)
> when x is INT_MIN, I assume -x is UD, so compiler can do anything.
> otherwise, (-x) >> 31 is just x > 0.
> From rtl view. neg of INT_MIN is assumed to 0 after it's truncated.

Wait, is -INT_MIN truncated to INT_MIN? if that's case, we can't do the
optimization at rtl.

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

Hongtao Liu  changed:

   What|Removed |Added

 Target||x86_64-*-* i?86-*-*

--- Comment #1 from Hongtao Liu  ---
when x is INT_MIN, I assume -x is UD, so compiler can do anything.
otherwise, (-x) >> 31 is just x > 0.
>From rtl view. neg of INT_MIN is assumed to 0 after it's truncated.
(neg:m x)
(ss_neg:m x)
(us_neg:m x)
These two expressions represent the negation (subtraction from zero) of the
value represented by x, carried out in mode m. They differ in the behavior on
overflow of integer modes. In the case of neg, the negation of the operand may
be a number not representable in mode m, in which case it is truncated to m.
ss_neg and us_neg ensure that an out-of-bounds result saturates to the maximum
or minimum signed or unsigned value.

so we can optimize (neg a)>>31 to a>0.

[Bug target/114429] New: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

Bug ID: 114429
   Summary: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef unsigned char uint8_t;
uint8_t x264_clip_uint8( int x )
{
return x&(~255) ? (-x)>>31 : x;
}

void
foo (int* a, int* __restrict b, int n)
{
for (int i = 0; i != 8; i++)
  b[i] = x264_clip_uint8 (a[i]);
}

gcc -O2 -march=x86-64-v3 -S


foo(int*, int*, int):
..
mov eax, 255
vpxor   xmm0, xmm0, xmm0
vmovd   xmm1, eax
vpbroadcastdymm1, xmm1
vmovdqu ymm2, YMMWORD PTR [rdi]
vpminud ymm3, ymm2, ymm1
vpsubd  ymm0, ymm0, ymm2
vmovdqa YMMWORD PTR [rsp-32], ymm3
vpsrad  ymm0, ymm0, 31
vpcmpeqdymm3, ymm2, YMMWORD PTR [rsp-32]
vpblendvb   ymm0, ymm0, ymm2, ymm3
vpand   ymm1, ymm1, ymm0
vmovdqu YMMWORD PTR [rsi], ymm1


It can be better with

mov eax, 255
vmovd   xmm1, eax
vpxor xmm0, xmm0, xmm0. 
vpbroadcastdymm1, xmm1
vmovdqu ymm2, YMMWORD PTR [rdi]
vpminud ymm3, ymm2, ymm1
vmovdqa YMMWORD PTR [rsp-32], ymm3
vcmpgtps  ymm0, ymm2, ymm0
vpcmpeqdymm3, ymm2, YMMWORD PTR [rsp-32]
vpblendvb   ymm0, ymm0, ymm2, ymm3
vpand   ymm1, ymm1, ymm0
vmovdqu YMMWORD PTR [rsi], ymm1

[Bug target/114428] New: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector (0xffff x4) can be optimized to psrld

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114428

Bug ID: 114428
   Summary: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector
(0x x4) can be optimized to psrld
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef unsigned short uint16_t;
typedef short int16_t;

#define QUANT_ONE( coef, mf, f )\
{ \
if( (coef) > 0 ) \
(coef) = (f + (coef)) * (mf) >> 16; \
else \
(coef) = - ((f - (coef)) * (mf) >> 16); \
nz |= (coef); \
}

int quant_4x4( int16_t dct[16], uint16_t mf[16], uint16_t bias[16] )
{
int nz = 0;
for( int i = 0; i < 16; i++ )
QUANT_ONE( dct[i], mf[i], bias[i] );
return !!nz;
}


gcc -O2 -march=x86-64-v3 -S

mov edx, 65535
vmovd   xmm4, edx
vpbroadcastdymm4, xmm4
...
vpsrad  ymm2, ymm2, 16
vpsrad  ymm6, ymm6, 16
vpsrad  ymm0, ymm0, 16
vpand   ymm2, ymm4, ymm2
vpsrad  ymm1, ymm1, 16
vpand   ymm6, ymm4, ymm6
vpand   ymm0, ymm4, ymm0
vpand   ymm4, ymm4, ymm1
vpackusdw   ymm2, ymm2, ymm6
vpackusdw   ymm0, ymm0, ymm4
vpermq  ymm2, ymm2, 216
vpermq  ymm0, ymm0, 216
...

it can be optimized to below.

vpsrld  ymm2, ymm2, 16
vpsrld  ymm6, ymm6, 16
vpsrld  ymm0, ymm0, 16
vpsrld  ymm1, ymm1, 16
vpackusdw   ymm2, ymm2, ymm6
vpackusdw   ymm0, ymm0, ymm4
vpermq  ymm2, ymm2, 216
vpermq  ymm0, ymm0, 216

The optimization opportunity is exposed after vec_pack_trunk_expr is expanded
to vpand + vpackusdw.

[Bug target/114427] New: [x86] ec_pack_truncv8si/v4si can be optimized with pblendw instead of pand for AVX2 target

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114427

Bug ID: 114427
   Summary: [x86] ec_pack_truncv8si/v4si can be optimized with
pblendw instead of pand for AVX2 target
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

void
foo (int* a, short* __restrict b, int* c)
{
for (int i = 0; i != 8; i++)
  b[i] = c[i] + a[i];
}

gcc -O2 -march=x86-64-v3 -S

mov eax, 65535
vmovd   xmm0, eax
vpbroadcastdxmm0, xmm0
vpand   xmm2, xmm0, XMMWORD PTR [rdi+16]
vpand   xmm1, xmm0, XMMWORD PTR [rdi]
vpackusdw   xmm1, xmm1, xmm2
vpand   xmm2, xmm0, XMMWORD PTR [rdx]
vpand   xmm0, xmm0, XMMWORD PTR [rdx+16]
vpackusdw   xmm0, xmm2, xmm0
vpaddw  xmm0, xmm1, xmm0
vmovdqu XMMWORD PTR [rsi], xmm0


It can be better with below, 

vpxor   %xmm0, %xmm0, %xmm0
vpblendw$85, 16(%rdi), %xmm0, %xmm2
vpblendw$85, (%rdi), %xmm0, %xmm1
vpackusdw   %xmm2, %xmm1, %xmm1
vpblendw$85, (%rdx), %xmm0, %xmm2
vpblendw$85, 16(%rdx), %xmm0, %xmm0
vpackusdw   %xmm0, %xmm2, %xmm0
vpaddw  %xmm0, %xmm1, %xmm0
vmovdqu %xmm0, (%rsi)

Currently, we're using (const_vector:v4si (const_int 0x) x4) as mask to
clear upper 16 bits, but pblendw with zero vector can also be used, and zero
vector is much cheaper than (const_vector:v4si (const_int 0x) x4)

mov eax, 65535
vmovd   xmm0, eax
vpbroadcastdxmm0, xmm0

pblendw has same latency as pand, but could be a little bit worse from
thoughput view(0.33->0.5 on ADL P-core, same on Zen4).

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #25 from Hongtao Liu  ---
Fixed in GCC14 and GCC13.3.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #20 from Hongtao Liu  ---
(In reply to JuzheZhong from comment #19)
> I think it's better to add pr114396.c into vect testsuite instead of x86
> target test since it's the bug not only happens on x86.

Sure, there's no target specific intrinsics in the testcase, I'll move that to
vect testsuite.

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

--- Comment #9 from Hongtao Liu  ---

> If we were to expose that vpxor before postreload we'd likely CSE but
> we have
> 
> 5: xmm0:V4SI=const_vector
>   REG_EQUIV const_vector
> 6: [`b']=xmm0:V4SI
> 7: xmm0:V8HI=const_vector
>   REG_EQUIV const_vector
> 8: [`a']=xmm0:V8HI
> 
> until the very end.  But since we have the same mode size on the xmm0
> sets CSE could easily handle (integral) constants by hashing/comparing
> on their byte representation rather than by using the RTX structure.
> OTOH as we mostly have special constants allowed in the IL like this
> treating all-zeros and all-ones specially might be good enough ...

We only handle scalar code, guess could do something similar, maybe 
1. iteraters over vector modes with same vector length?
2. iteraters over vector modes with same component mode but with bigger vector
length?

But will miss v8hi/v8si pxor, another alternative is canonicalize const_vector
with scalar mode, i.e v4si -> TI, v8si -> OI, v16si -> XI. then we can just
query with TI/OI/XImode?


4873  /* See if we have a CONST_INT that is already in a register in a
4874 wider mode.  */
4875
4876  if (src_const && src_related == 0 && CONST_INT_P (src_const)
4877  && is_int_mode (mode, _mode)
4878  && GET_MODE_PRECISION (int_mode) < BITS_PER_WORD)
4879{
4880  opt_scalar_int_mode wider_mode_iter;
4881  FOR_EACH_WIDER_MODE (wider_mode_iter, int_mode)
4882{
4883  scalar_int_mode wider_mode = wider_mode_iter.require ();
4884  if (GET_MODE_PRECISION (wider_mode) > BITS_PER_WORD)
4885break;
4886
4887  struct table_elt *const_elt
4888= lookup (src_const, HASH (src_const, wider_mode),
wider_mode);
4889
4890  if (const_elt == 0)
4891continue;
4892
4893  for (const_elt = const_elt->first_same_value;
4894   const_elt; const_elt = const_elt->next_same_value)
4895if (REG_P (const_elt->exp))
4896  {
4897src_related = gen_lowpart (int_mode, const_elt->exp);
4898break;
4899  }
4900
4901  if (src_related != 0)
4902break;
4903}
4904}

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu  ---
Another simple case is 

typedef int v4si __attribute__((vector_size(16)));
typedef short v8hi __attribute__((vector_size(16)));

v8hi a;
v4si b;
void
foo ()
{
   b = __extension__(v4si){0, 0, 0, 0};
   a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0};
}

GCC generates 2 pxor

foo():
vpxor   xmm0, xmm0, xmm0
vmovdqa XMMWORD PTR b[rip], xmm0
vpxor   xmm0, xmm0, xmm0
vmovdqa XMMWORD PTR a[rip], xmm0
ret

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #17 from Hongtao Liu  ---
> > 
> > The to_mpz args look like they could be mixing signs as well:
> > 
I tries below, looks like mixing signs works well.
debug show step_expr is -5 and signed.

short a = 0xF;
short b[16];
unsigned short ua = 0xF;
unsigned short ub[16];

int main() {
  for (int e = 0; e < 9; e += 1)
b[e] = a *= 0x5;
  __builtin_printf("decimal: %d\n", a);
  __builtin_printf("hex: %X\n", a);

  for (int e = 0; e < 9; e += 1)
b[e] = a *= -5;
  __builtin_printf("decimal: %d\n", a);
  __builtin_printf("hex: %X\n", a);

  for (int e = 0; e < 9; e += 1)
ub[e] = ua *= 0x5;
  __builtin_printf("decimal: %d\n", ua);
  __builtin_printf("hex: %X\n", ua);

  for (int e = 0; e < 9; e += 1)
ub[e] = ua *= -5;
  __builtin_printf("decimal: %d\n", ua);
  __builtin_printf("hex: %X\n", ua);

}

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #16 from Hongtao Liu  ---
Mine.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #15 from Hongtao Liu  ---
(In reply to Richard Biener from comment #9)
> (In reply to Robin Dapp from comment #8)
> > No fallout on x86 or aarch64.
> > 
> > Of course using false instead of TYPE_SIGN (utype) is also possible and
> > maybe clearer?
> 
> Well, wi::from_mpz doesn't take a sign argument.  It's comment says
> 
> /* Returns X converted to TYPE.  If WRAP is true, then out-of-range
>values of VAL will be wrapped; otherwise, they will be set to the
>appropriate minimum or maximum TYPE bound.  */
> wide_int
> wi::from_mpz (const_tree type, mpz_t x, bool wrap)
> 
> I'm not sure if we really want saturating behavior here, so 'true' is
> more correct?  Note if we want an unsigned result we should pass utype here,
> that might be the bug?  So
> 
> begin = wi::from_mpz (utype, res, true);
> 
> ?
Yes, it should be.
> 
> The to_mpz args look like they could be mixing signs as well:
> 
> case vect_step_op_mul:
>   {
> tree utype = unsigned_type_for (type);
> init_expr = gimple_convert (stmts, utype, init_expr);
> wide_int skipn = wi::to_wide (skip_niters);
> wide_int begin = wi::to_wide (step_expr);
> auto_mpz base, exp, mod, res;
> wi::to_mpz (begin, base, TYPE_SIGN (type));
> 
> TYPE_SIGN (step_expr)?
step_expr should have same type as init_expr.
> 
> wi::to_mpz (skipn, exp, UNSIGNED);
> 
> TYPE_SIGN (skip_niters) (which should be UNSIGNED I guess)?
skipn must be a postive value, so I assume UNSIGNED/SIGNED doesn't make any
difference here.

[Bug tree-optimization/67683] Missed vectorization: shifts of an induction variable

2024-03-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #6 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #5)
> /app/example.cpp:5:20: note:   vect_is_simple_use: operand # RANGE [irange]
> short unsigned int [0, 2047][3294, 3294][6589, 6589][13179, 13179][26359,
> 26359][52719, 52719]
> val_16 = PHI , type of def: induction
> 
> We detect it now.
> But then it is still not vectorized ...

We don't know how to peel for variable niter, there could be UD if we peel it
like val(epilogue) = val >> (max / vf) * vf.

[Bug middle-end/114347] wrong constant folding when casting __bf16 to int

2024-03-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114347

--- Comment #9 from Hongtao Liu  ---
(In reply to Richard Biener from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > You can use -fexcess-precision=16 if you don't want treating _Float16 and
> > __bf16 as having excess precision.  With excess precision, I think the above
> > behavior is correct.
> > You'd need (int) (__bf16) 257.0bf16 to get 256 even with excess precision.
> 
> Ah, -fexcess-precision=16 doesn't seem to be documented though (how does
> this influence long double handling then?)

Oh, I forgot to add that in invoke.texi.

-fexcess-precision=16 doesn't impact types with precision > 16. And it's not
compatible with -mfpmath=387.

[Bug target/114334] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn and:HF?) with lroundf16() and -ffast-math -mavx512fp16

2024-03-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Hongtao Liu  ---
Fixed in GCC14.

[Bug tree-optimization/66862] OpenMP SIMD does not work (use SIMD instructions) on conditional code

2024-03-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862

--- Comment #5 from Hongtao Liu  ---
> Now, it seems AVX512BW (and AVX512VL in some cases) has the needed
> instructions,
> in particular VMOVDQU{8,16}, but it is not reflected in maskload and
> maskstore expanders.  CCing Kyrill and Uros on this.

w/ -mavx512bw and -mavx512vl, the loop is vectorized since GCC 8.1.

[Bug target/114334] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn and:HF?) with lroundf16() and -ffast-math -mavx512fp16

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-03-15
 CC||liuhongt at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Hongtao Liu  ---
Mine

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #15 from Hongtao Liu  ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647604.html

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #17 from Hongtao Liu  ---
Forget to add PR to my commit, it's solved by r14-9459-g618e34d56cc38e and
backport to r13-8438-gbdbcfbfcf59138, r12-10214-ga861f940efffae.

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #14 from Hongtao Liu  ---
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 0de299c62e3..92062378d8e 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class
stack_vars_data *data)
{
  if (data->asan_vec.is_empty ())
{
- align_frame_offset (ASAN_RED_ZONE_SIZE);
+ align_frame_offset (MAX (alignb, ASAN_RED_ZONE_SIZE));
  prev_offset = frame_offset.to_constant ();
}
  prev_offset = align_base (prev_offset,


This can fix the issue, but not sure if it's the correct way.

[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291

2024-03-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #16 from Hongtao Liu  ---
(In reply to Thomas Neumann from comment #15)
> Created attachment 57679 [details]
> fixed patch
> 
> Can you please try the updated patch? I had accidentally dropped an if
> nesting level when trying to adhere to the gcc style, sorry for that.

I'm trying to validating your patch, but it could take sometime to setup
enviroments.

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #13 from Hongtao Liu  ---
So the stack is like

--- stack top

-32

- (offset -32)

-64 (32 bytes redzone)

- (offset -64)

-128 (64 bytes __m512)

 (offset -128)

 (32-bytes redzone)

---(offset -160)   <--- __asan_stack_malloc_128 try to allocate an buffer 


  /* Emit the prologue sequence.  */
  if (asan_frame_size > 32 && asan_frame_size <= 65536 && pbase
  && param_asan_use_after_return)
{
  use_after_return_class = floor_log2 (asan_frame_size - 1) - 5;
  /* __asan_stack_malloc_N guarantees alignment
 N < 6 ? (64 << N) : 4096 bytes.  */
  if (alignb > (use_after_return_class < 6
? (64U << use_after_return_class) : 4096U))
use_after_return_class = -1;
  else if (alignb > ASAN_RED_ZONE_SIZE && (asan_frame_size & (alignb - 1)))
base_align_bias = ((asan_frame_size + alignb - 1)
   & ~(alignb - HOST_WIDE_INT_1)) - asan_frame_size;
}

  /* Align base if target is STRICT_ALIGNMENT.  */
  if (STRICT_ALIGNMENT)
{
  const HOST_WIDE_INT align
= (GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT) << ASAN_SHADOW_SHIFT;
  base = expand_binop (Pmode, and_optab, base, gen_int_mode (-align,
Pmode),
   NULL_RTX, 1, OPTAB_DIRECT);
}

  if (use_after_return_class == -1 && pbase)
emit_move_insn (pbase, base);

  base = expand_binop (Pmode, add_optab, base,
   gen_int_mode (base_offset - base_align_bias, Pmode),
   NULL_RTX, 1, OPTAB_DIRECT); -- suspicious add

  orig_base = NULL_RTX;
  if (use_after_return_class != -1)
{
  ...
  ret = emit_library_call_value (ret, NULL_RTX, LCT_NORMAL, ptr_mode,
 GEN_INT (asan_frame_size
  + base_align_bias),
 TYPE_MODE (pointer_sized_int_node));
  /* __asan_stack_malloc_[n] returns a pointer to fake stack if succeeded
 and NULL otherwise.  Check RET value is NULL here and jump over the
 BASE reassignment in this case.  Otherwise, reassign BASE to RET.  */
  emit_cmp_and_jump_insns (ret, const0_rtx, EQ, NULL_RTX,
   VOIDmode, 0, lab,
   profile_probability:: very_unlikely ());
  ret = convert_memory_address (Pmode, ret);
  emit_move_insn (base, ret);
  emit_label (lab);
  emit_move_insn (pbase, expand_binop (Pmode, add_optab, base,
   gen_int_mode (base_align_bias
 - base_offset, Pmode),
   NULL_RTX, 1, OPTAB_DIRECT));


base_align_bias is calculated to make (asan_frame_size(128) +
base_align_bias(0)) be multiple of alignb (64),  but didn't make `base_offset
(160) - base_align_bias (0)` be multiple of 64, so when __asan_stack_malloc_128
return an address aligned to 64, and then plus (base_offset (160) -
base_align_bias (0)), it's misaligned.

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #12 from Hongtao Liu  ---
(In reply to Sam James from comment #11)
> Calling it a 11..14 regression as we know 14 is bad and 7.5 is OK, but I
> can't test 11/12 on an avx512 machine right now.

I can't reproduce that with 11/12, but with gcc13 for the case in PR114276.

It looks like the codegen is already wrong in .expand, the offensive part is
mentioned in #c0

>Now, if `__asan_option_detect_stack_use_after_return` is 0, the variable at 
>>%rcx-128 is correctly aligned to 64. However, if it is 1, 
>__asan_stack_malloc_1 >returns something aligned to 64 << 1 (as per 
>https://github.com/gcc->mirror/gcc/blob/master/gcc/asan.cc#L1917) and adding 
>160 results in %rcx-128 >being only aligned to 32. And thus the segfault.


;; Function foo (_Z3foov, funcdef_no=14, decl_uid=3962, cgraph_uid=10,
symbol_order=9)

(note 1 0 37 NOTE_INSN_DELETED)
;; basic block 2, loop depth 0, maybe hot
;;  prev block 0, next block 3, flags: (NEW, REACHABLE, RTL, MODIFIED)
;;  pred:   ENTRY (FALLTHRU)
(note 37 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 37 3 2 (parallel [
(set (reg:DI 105)
(plus:DI (reg/f:DI 19 frame)
(const_int -160 [0xff60])))
(clobber (reg:CC 17 flags))
]) "test1.cc":7:12 247 {*adddi_1}
 (nil))
(insn 3 2 4 2 (set (reg:DI 106)
(reg:DI 105)) "test1.cc":7:12 82 {*movdi_internal}
 (nil))
(insn 4 3 5 2 (set (reg:CCZ 17 flags)
(compare:CCZ (mem/c:SI (symbol_ref:DI
("__asan_option_detect_stack_use_after_return") [flags 0x40]  ) [4
__asan_option_detect_stack_use_after_return+0 S4 A32])
(const_int 0 [0]))) "test1.cc":7:12 7 {*cmpsi_ccno_1}
 (nil))
(jump_insn 5 4 93 2 (set (pc)
(if_then_else (eq (reg:CCZ 17 flags)
(const_int 0 [0]))
(label_ref 11)
(pc))) "test1.cc":7:12 995 {*jcc}
 (nil)
 -> 11)
;;  succ:   5
;;  3 (FALLTHRU)

;; basic block 3, loop depth 0, maybe hot
;;  prev block 2, next block 4, flags: (NEW, REACHABLE, RTL, MODIFIED)
;;  pred:   2 (FALLTHRU)
(note 93 5 6 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(insn 6 93 7 3 (set (reg:DI 5 di)
(const_int 128 [0x80])) "test1.cc":7:12 82 {*movdi_internal}
 (nil))
(call_insn 7 6 8 3 (set (reg:DI 0 ax)
(call (mem:QI (symbol_ref:DI ("__asan_stack_malloc_1") [flags 0x41] 
) [0  S1 A8])
(const_int 0 [0]))) "test1.cc":7:12 1013 {*call_value}
 (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000])
(nil))
(expr_list (use (reg:DI 5 di))
(nil)))
(insn 8 7 9 3 (set (reg:CCZ 17 flags)
(compare:CCZ (reg:DI 0 ax)
(const_int 0 [0]))) "test1.cc":7:12 8 {*cmpdi_ccno_1}
 (nil))
(jump_insn 9 8 94 3 (set (pc)
(if_then_else (eq (reg:CCZ 17 flags)
(const_int 0 [0]))
(label_ref 11)
(pc))) "test1.cc":7:12 995 {*jcc}
 (nil)
 -> 11)
;;  succ:   5
;;  4 (FALLTHRU)
;; basic block 4, loop depth 0, maybe hot
;;  prev block 3, next block 5, flags: (NEW, REACHABLE, RTL, MODIFIED)
;;  pred:   3 (FALLTHRU)
(note 94 9 10 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 10 94 11 4 (set (reg:DI 105)
(reg:DI 0 ax)) "test1.cc":7:12 82 {*movdi_internal}
 (nil))
;;  succ:   5 (FALLTHRU)

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #16 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #11)
> (In reply to Richard Biener from comment #10)
> > The easiest fix would be to refuse applying STV to a insn that
> > can_throw_internal () (that's an insn that has associated EH info).  
> > Updating
> > in this case would require splitting the BB or at least moving the now
> > no longer throwing insn to the next block (along the fallthru edge).
> 
> This would be simply:
> 
> --cut here--
> diff --git a/gcc/config/i386/i386-features.cc
> b/gcc/config/i386/i386-features.cc
> index 1de2a07ed75..90acb33db49 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -437,6 +437,10 @@ scalar_chain::add_insn (bitmap candidates, unsigned int
> insn_uid,
>&& !HARD_REGISTER_P (SET_DEST (def_set)))
>  bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>  
> +  if (cfun->can_throw_non_call_exceptions
> +  && can_throw_internal (insn))
> +return false;
> +
>/* ???  The following is quadratic since analyze_register_chain
>   iterates over all refs to look for dual-mode regs.  Instead this
>   should be done separately for all regs mentioned in the chain once.  */
> --cut here--
> 
> But I think, we could do better. Adding CC.

It looks like the similar issue we have solved in PR89650 with
r9-6543-g12fb7712a8a20f. We manually split the block after insn.

[Bug d/114171] [13/14 Regression] gdc -O2 -mavx generates misaligned vmovdqa instruction

2024-02-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114171

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
   Last reconfirmed||2024-3-1

--- Comment #2 from Hongtao Liu  ---
on rtl level,we get

(insn 7 6 8 2 (set (reg:CCZ 17 flags)
(compare:CCZ (mem:TI (plus:DI (reg/f:DI 100 [ _5 ])
(const_int 24 [0x18])) [0 MEM[(ucent *)_5 + 24B]+0 S16
A128])
(const_int 0 [0]))) "test.d":15:16 30 {*cmpti_doubleword}
 (nil))

It's 16-byte aligned.

[Bug tree-optimization/114164] simdclone vectorization creates unsupported IL

2024-02-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114164

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #2)
> (In reply to Richard Biener from comment #1)
> > I'm not sure who's responsible to reject this, whether the vectorizer can
> > expect there's a way to create the mask arguments when the simdclone is
> > marked usable by the target or whether it has to verify that itself.
> > 
> > This becomes an ICE if we move vector lowering before vectorization.
> 
> Wasn't this valid when VEC_COND_EXPR allowed the comparison directly in the
> operand?
> Or maybe I misremember.  Certainly I believe -mavx -mno-avx2 should be able
> to do
> 256-bit conditional moves of float/double elements.

Here, mask is v4si which is 128-bit, and vector is v4df which is 256-bit
w/o avx512, x86 backend only supports vcond/vcond_mask with same size
(vcond{,_mask}v4sfv4si or vcond{,_mask}v4dfv4di), but not
vcond{,_mask}v4dfv4si.

BTW, we may get v4di mask from v4si mask by

vshufps xmm1, xmm0, xmm0, 80# xmm1 = xmm0[0,0,1,1]
vshufps xmm0, xmm0, xmm0, 250   # xmm0 = xmm0[2,2,3,3]
vinsertf128 ymm0, ymm1, xmm0, 1


under AVX, under AVX2 we can just use pmovsxdq

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #16 from Hongtao Liu  ---

> I'm all for removing the 1/3 for innermost loop handling (in cunroll
> the unrolled loop is then innermost).  I'm more concerned about
> unrolling more than one level which is exactly what's required for
> 454.calculix.

Removing 1/3 for the innermost loop would be sufficient to solve both the issue
in the PR and x264_pixel_var_8x8 from 525.x264_r. I'll try to benchmark that.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #14 from Hongtao Liu  ---
(In reply to rguent...@suse.de from comment #13)
> On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> > 
> > --- Comment #11 from Hongtao Liu  ---
> > 
> > >Loop body is likely going to simplify further, this is difficult
> > >to guess, we just decrease the result by 1/3.  */
> > > 
> > 
> > This is introduced by r0-68074-g91a01f21abfe19
> > 
> > /* Estimate number of insns of completely unrolled loop.  We assume
> > +   that the size of the unrolled loop is decreased in the
> > +   following way (the numbers of insns are based on what
> > +   estimate_num_insns returns for appropriate statements):
> > +
> > +   1) exit condition gets removed (2 insns)
> > +   2) increment of the control variable gets removed (2 insns)
> > +   3) All remaining statements are likely to get simplified
> > +  due to constant propagation.  Hard to estimate; just
> > +  as a heuristics we decrease the rest by 1/3.
> > +
> > +   NINSNS is the number of insns in the loop before unrolling.
> > +   NUNROLL is the number of times the loop is unrolled.  */
> > +
> > +static unsigned HOST_WIDE_INT
> > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +unsigned HOST_WIDE_INT nunroll)
> > +{
> > +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  if (unr_insns <= 0)
> > +unr_insns = 1;
> > +  unr_insns *= (nunroll + 1);
> > +
> > +  return unr_insns;
> > +}
> > 
> > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> > keeped.
> > 
> > +/* Estimate number of insns of completely unrolled loop.
> > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > +   the fact that in last copy everything after exit conditional
> > +   is dead and that some instructions will be eliminated after
> > +   peeling.
> > 
> > -   NINSNS is the number of insns in the loop before unrolling.
> > -   NUNROLL is the number of times the loop is unrolled.  */
> > +   Loop body is likely going to simplify futher, this is difficult
> > +   to guess, we just decrease the result by 1/3.  */
> > 
> >  static unsigned HOST_WIDE_INT
> > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +estimated_unrolled_size (struct loop_size *size,
> >  unsigned HOST_WIDE_INT nunroll)
> >  {
> > -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  HOST_WIDE_INT unr_insns = ((nunroll)
> > +* (HOST_WIDE_INT) (size->overall
> > +   -
> > size->eliminated_by_peeling));
> > +  if (!nunroll)
> > +unr_insns = 0;
> > +  unr_insns += size->last_iteration -
> > size->last_iteration_eliminated_by_peeling;
> > +
> > +  unr_insns = unr_insns * 2 / 3;
> >if (unr_insns <= 0)
> >  unr_insns = 1;
> > -  unr_insns *= (nunroll + 1);
> > 
> > It looks to me 1 / 3 overestimates the instructions that can be optimised 
> > away,
> > especially if we've subtracted eliminated_by_peeling
> 
> Yes, that 1/3 reduction is a bit odd - you could have the same effect
> by increasing the instruction limit by 1/3, but that means it doesn't
> really matter, does it?  It would be interesting to see if increasing
> the limit by 1/3 and removing the above is neutral on SPEC?

Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
-march=native -O3, no big impact on other integer benchmark.

The regression comes from below function, cunrolli unrolls the inner loop,
cunroll unrolls the outer loop, and causes lots of spills.

typedef unsigned long long uint64_t;
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
uint64_t x264_pixel_var_8x8(uint8_t *pix, int i_stride )
{
uint32_t sum = 0, sqr = 0;
for( int y = 0; y < 8; y++ )
{
for( int x = 0; x < 8; x++ ) 
{
sum += pix[x]; 
sqr += pix[x] * pix[x]; 
}  
pix += i_stride;   
}   
return sum + ((uint64_t)sqr << 32);
}

[Bug target/114125] Support vcond_mask_qiqi and friends.

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125

Hongtao Liu  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-02-27
 Target||x86_64-*-* i?86-*-*

[Bug target/114125] New: Support vcond_mask_qiqi and friends.

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125

Bug ID: 114125
   Summary: Support vcond_mask_qiqi and friends.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

Quote from https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646587.html

> On Linux/x86_64,
>
> af66ad89e8169f44db723813662917cf4cbb78fc is the first bad commit
> commit af66ad89e8169f44db723813662917cf4cbb78fc
> Author: Richard Biener 
> Date:   Fri Feb 23 16:06:05 2024 +0100
>
> middle-end/114070 - folding breaking VEC_COND expansion
>
> caused
>
> FAIL: gcc.dg/tree-ssa/andnot-2.c scan-tree-dump-not forwprop3 "_expr"

This shows that the x86 backend is missing vcond_mask_qiqi and friends
(for AVX512 mask modes).  Either that or both expand_vec_cond_expr_p
and all the machinery behind it (ISEL pass, lowering) should handle
pure integer mode VEC_COND_EXPR via bit operations.  I think quite some
targets now implement patterns for these variants, whatever their
boolean vector modes are.

One complication with the change, which was

  (simplify
   (op @3 (vec_cond:s @0 @1 @2))
-  (vec_cond @0 (op! @3 @1) (op! @3 @2
+  (if (TREE_CODE_CLASS (op) != tcc_comparison
+   || types_match (type, TREE_TYPE (@1))
+   || expand_vec_cond_expr_p (type, TREE_TYPE (@0), ERROR_MARK))
+   (vec_cond @0 (op! @3 @1) (op! @3 @2)

is that expand_vec_cond_expr_p can also handle comparison defined
masks, but whether or not we have this isn't visible here so we
can only check whether vcond_mask expansion would work.

We have optimize_vectors_before_lowering_p but we shouldn't even there
turn supported into not supported ops and as said, what's supported or
not cannot be finally decided (if it's only vcond and not vcond_mask
that is supported).  Also optimize_vectors_before_lowering_p is set
for a short time between vectorization and vector lowering and we
definitely do not want to turn supported vectorizer emitted stmts
into ones that we need to lower.  For GCC 15 we should see to move
vector lowering before vectorization (before loop optimization I'd
say) to close this particula hole (and also reliably ICE when the
vectorizer creates unsupported IL).  We also definitely want to
retire vcond expanders (no target I know of supports single-instruction
compare-and-select).

So short term we either live with this regression (the testcase
verifies we perform constant folding to { 0, 0 }), implement
the four missing patterns (qi, hi, si and di missing value mode
vcond_mask patterns) or see to implement generic code for this.

Given precedent I'd tend towards adding the x86 patterns.

Hongtao, can you handle that?

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #11 from Hongtao Liu  ---

>Loop body is likely going to simplify further, this is difficult
>to guess, we just decrease the result by 1/3.  */
> 

This is introduced by r0-68074-g91a01f21abfe19

/* Estimate number of insns of completely unrolled loop.  We assume
+   that the size of the unrolled loop is decreased in the
+   following way (the numbers of insns are based on what
+   estimate_num_insns returns for appropriate statements):
+
+   1) exit condition gets removed (2 insns)
+   2) increment of the control variable gets removed (2 insns)
+   3) All remaining statements are likely to get simplified
+  due to constant propagation.  Hard to estimate; just
+  as a heuristics we decrease the rest by 1/3.
+
+   NINSNS is the number of insns in the loop before unrolling.
+   NUNROLL is the number of times the loop is unrolled.  */
+
+static unsigned HOST_WIDE_INT
+estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
+unsigned HOST_WIDE_INT nunroll)
+{
+  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
+  if (unr_insns <= 0)
+unr_insns = 1;
+  unr_insns *= (nunroll + 1);
+
+  return unr_insns;
+}

And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
keeped.

+/* Estimate number of insns of completely unrolled loop.
+   It is (NUNROLL + 1) * size of loop body with taking into account
+   the fact that in last copy everything after exit conditional
+   is dead and that some instructions will be eliminated after
+   peeling.

-   NINSNS is the number of insns in the loop before unrolling.
-   NUNROLL is the number of times the loop is unrolled.  */
+   Loop body is likely going to simplify futher, this is difficult
+   to guess, we just decrease the result by 1/3.  */

 static unsigned HOST_WIDE_INT
-estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
+estimated_unrolled_size (struct loop_size *size,
 unsigned HOST_WIDE_INT nunroll)
 {
-  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
+  HOST_WIDE_INT unr_insns = ((nunroll)
+* (HOST_WIDE_INT) (size->overall
+   -
size->eliminated_by_peeling));
+  if (!nunroll)
+unr_insns = 0;
+  unr_insns += size->last_iteration -
size->last_iteration_eliminated_by_peeling;
+
+  unr_insns = unr_insns * 2 / 3;
   if (unr_insns <= 0)
 unr_insns = 1;
-  unr_insns *= (nunroll + 1);

It looks to me 1 / 3 overestimates the instructions that can be optimised away,
especially if we've subtracted eliminated_by_peeling

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #10 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #9)
> The original case is a little different from the one in PR.
But the issue is similar, after cunrolli, GCC failed to vectorize the outer
loop.

The interesting thing is in estimated_unrolled_size, the original unr_insns is
288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
is decreased by 1/3 due to

   Loop body is likely going to simplify further, this is difficult
   to guess, we just decrease the result by 1/3.  */

In practice, this loop body is not simplied for 1/3 of the instructions.

Considering the unroll factor is 16, the unr_insn is large(192), I was
wondering if we could add some heuristic algorithm to avoid complete loop
unroll, because usually for such a big loop, both loop and BB vectorizer may
not perform well.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #9 from Hongtao Liu  ---
The original case is a little different from the one in PR.
It comes from ggml

#include 
#include 

typedef uint16_t ggml_fp16_t;
static float table_f32_f16[1 << 16];

inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
uint16_t s;
memcpy(, , sizeof(uint16_t));
return table_f32_f16[s];
}

typedef struct {
ggml_fp16_t d;
ggml_fp16_t m;
uint8_t qh[4];
uint8_t qs[32 / 2];
} block_q5_1;

typedef struct {
float d;
float s;
int8_t qs[32];
} block_q8_1;

void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void *
restrict vx, const void * restrict vy) {
const int qk = 32;
const int nb = n / qk;

const block_q5_1 * restrict x = vx;
const block_q8_1 * restrict y = vy;

float sumf = 0.0;

for (int i = 0; i < nb; i++) {
uint32_t qh;
memcpy(, x[i].qh, sizeof(qh));

int sumi = 0;

for (int j = 0; j < qk/2; ++j) {
const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;

const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
const int32_t x1 = (x[i].qs[j] >> 4) | xh_1;

sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
}

sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi +
ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
}

*s = sumf;
}

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #11 from Hongtao Liu  ---
(In reply to N Schaeffer from comment #9)
> In addition, optimizing for size with -Os leads to a non-vectorized
> double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced
> by clang -Os) leads to 40 bytes.
> It is thus also a missed optimization for -Os.

vectorization is enabled with O2 but not Os.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #8 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #7)
> perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
> pshufb/shufps are avaible for most cases.
> But for 256/512-bit vectors, when the permuation is cross-lane, the cost
> could be higher. One solution is increase perm_cost when vector size is more
> than 128 since vperm is most likely used instead of
> vblend/vpblend/vpshuf/vshuf.

Furthermore, if we can get indices in the backend when calculating vec_perm
cost, we can check if the permutation is cross-lane or not, and set cost more
accurately for 256/512-bit vector permutation.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu  ---
perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
pshufb/shufps are avaible for most cases.
But for 256/512-bit vectors, when the permuation is cross-lane, the cost could
be higher. One solution is increase perm_cost when vector size is more than 128
since vperm is most likely used instead of vblend/vpblend/vpshuf/vshuf.

[Bug tree-optimization/109885] gcc does not generate movmskps and testps instructions (clang does)

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109885

--- Comment #4 from Hongtao Liu  ---
int sum() {
   int ret = 0;
   for (int i=0; i<8; ++i) ret +=(0==v[i]);
   return ret;
}

int sum2() {
   int ret = 0;
   auto m = v==0;
   for (int i=0; i<8; ++i) ret += m[i];
   return ret;
}

For sum, gcc tries to reduce for an {0/1, 0/1, ...} vector, for sum2, it tries
to reduce {0/-1,0/-1,...} vector. But LLVM tries to reduce {0/1, 0/1, ... }
vector for both sum and sum2. Not sure which is correct?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #57 from Hongtao Liu  ---
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

Make sense.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #45 from Hongtao Liu  ---

> > There's do_store_flag to fixup for uses not in branches and
> > do_compare_and_jump for conditional jumps.
> 
> reasonable enough for me.
I mean we only handle it at consumers where upper bits matters.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #44 from Hongtao Liu  ---
> 
> Note the AND is removed by combine if I add it:
> 
> Successfully matched this instruction:
> (set (reg:CCZ 17 flags)
> (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
> (const_int 15 [0xf]))
> (const_int 0 [0])))
> 
> (*testhi_not)
> 
> -9: {r103:QI=r102:QI&0xf;clobber flags:CC;}
> +  REG_DEAD r99:QI
> +9: NOTE_INSN_DELETED
> +   12: flags:CCZ=cmp(~r102:QI#0&0xf,0)
>REG_DEAD r102:QI
> -  REG_UNUSED flags:CC
> -   12: flags:CCZ=cmp(r103:QI,0xf)
> -  REG_DEAD r103:QI
> 
> and we get
> 
> foo:
> .LFB0:
> .cfi_startproc
> notl%esi
> orl %esi, %edi
> notl%edi
> testb   $15, %dil
> je  .L6
> ret
> 
> which I'm not sure is OK?
> 

Yes, I think it's on purpose

11508;; Split and;cmp (as optimized by combine) into not;test
11509;; Except when TARGET_BMI provides andn (*andn__ccno).
11510(define_insn_and_split "*test_not"
11511  [(set (reg:CCZ FLAGS_REG)
11512(compare:CCZ
11513  (and:SWI
11514(not:SWI (match_operand:SWI 0 "register_operand"))
11515(match_operand:SWI 1 ""))
11516  (const_int 0)))]
11517  "ix86_pre_reload_split ()
11518   && (!TARGET_BMI || !REG_P (operands[1]))"
11519  "#"
11520  "&& 1"
11521  [(set (match_dup 2) (not:SWI (match_dup 0)))
11522   (set (reg:CCZ FLAGS_REG)
11523(compare:CCZ (and:SWI (match_dup 2) (match_dup 1))
11524 (const_int 0)))]
11525  "operands[2] = gen_reg_rtx (mode);")
11526
11527;; Split and;cmp (as optimized by combine) into andn;cmp $0
11528(define_insn_and_split "*test_not_doubleword"

  1   2   3   >