from:"Hongtao Liu"

Re: Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h

2024-05-30 Thread Hongtao Liu via Gcc

On Fri, May 31, 2024 at 10:58 AM Hanke Zhang via Gcc  wrote:
>
> Hi,
> I've recently been trying to hand-write code to trigger automatic
> vectorization optimizations in GCC on Intel x86 machines (without
> using the interfaces in immintrin.h), but I'm running into a problem
> where I can't seem to get the concise `vpmovzxbd` or similar
> instructions.
>
> My requirement is to convert 8 `uint8_t` elements to `int32_t` type
> and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
> immintrin.h, the code is as follows:
>
> int immintrin () {
> int size = 1, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> After compiling with -mavx2 -O3, you can get concise and efficient
> instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)
>
> But if I do not use this interface and instead use a for-loop or the
> `__builtin_convertvector` interface provided by GCC, I cannot achieve
> the above effect. The code is as follows:
>
> typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
> int forloop () {
> int size = 1, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> v8qiu av = *(v8qiu *)(a + offset);
> __v8si b = {};
> for (int i = 0; i < 8; i++) {
> b[i] = (a + offset)[i];
> }
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> int builtin_cvt () {
> int size = 1, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> v8qiu av = *(v8qiu *)(a + offset);
> __v8si b = __builtin_convertvector(av, __v8si);
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> The instructions generated by both functions are redundant and
> complex, and are quite difficult to read compared to calling
> `_mm256_cvtepu8_epi32` directly. (You can see it here as well:
> https://godbolt.org/z/8ojzdav47)
>
> What I want to ask is: How should I write the source code to get
> assembly instructions similar to directly calling
> _mm256_cvtepu8_epi32?
>
> Or would it be easier if I modified the GIMPLE directly? But it seems
> that there is no relevant expression or interface directly
> corresponding to `vpmovzxbd` in GIMPLE.
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652484.html
We're working on the patch to optimize __builtin_convertvector, after
that it can be as optimal as intel intrinsic.
>
> Thanks
> Hanke Zhang



-- 
BR,
Hongtao

Re: /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_mask,

2023-08-06 Thread Hongtao Liu via Gcc

On Mon, Aug 7, 2023 at 9:38 AM Hongtao Liu  wrote:
>
> On Mon, Aug 7, 2023 at 9:35 AM Hongtao Liu  wrote:
> >
> > On Mon, Aug 7, 2023 at 2:08 AM Toon Moene  wrote:
> > >
> > > Wonder if I am the only one to see this:
> > >
> > > https://gcc.gnu.org/pipermail/gcc-testresults/2023-August/792616.html
> Could you share your GCC configure, I guess
> --enable-checking=yes,rtl,extra is key to reproduce the issue?
Reproduce with  --with-cpu=native --with-arch=native
---enable-checking=yes,rtl,extra on an AVX512 machine.
So on non-avx512 machine --with-cpu=cascadelake
--with-arch=cascadelake --enable-checking=yes,rtl,extra should be
enough.
> > >
> > > To quote:
> > >
> > > during RTL pass: split1
> > > /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c: In function
> > > 'matmul_i1_avx512f':
> > > /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1:
> > > internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have
> > > 'w' (rtx const_int) in vpternlog_redundant_operand_mask, at
> > > config/i386/i386.cc:19460
> > >   1781 | }
> > >| ^
> > > during RTL pass: split1
> > > /home/toon/compilers/gcc/libgfortran/generated/matmul_i2.c: In function
> > > 'matmul_i2_avx512f':
> > > /home/toon/compilers/gcc/libgfortran/generated/matmul_i2.c:1781:1:
> > > internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have
> > > 'w' (rtx const_int) in vpternlog_redundant_operand_mask, at
> > > config/i386/i386.cc:19460
> > >   1781 | }
> > >| ^
> > > 0x7a5cc7 rtl_check_failed_type2(rtx_def const*, int, int, int, char
> > > const*, int, char const*)
> > > /home/toon/compilers/gcc/gcc/rtl.cc:761
> > > 0x82bf8d vpternlog_redundant_operand_mask(rtx_def**)
> > > /home/toon/compilers/gcc/gcc/config/i386/i386.cc:19460
> > > 0x1f1295b split_44
> > > /home/toon/compilers/gcc/gcc/config/i386/sse.md:12730
> > > 0x1f1295b split_63
> > > /home/toon/compilers/gcc/gcc/config/i386/sse.md:28428
> > > 0xe7663b try_split(rtx_def*, rtx_insn*, int)
> > > /home/toon/compilers/gcc/gcc/emit-rtl.cc:3800
> > > 0xe76cff try_split(rtx_def*, rtx_insn*, int)
> > > /home/toon/compilers/gcc/gcc/emit-rtl.cc:3972
> > > 0x11b2938 split_insn
> > > /home/toon/compilers/gcc/gcc/recog.cc:3385
> > > 0x11b2eff split_all_insns()
> > > /home/toon/compilers/gcc/gcc/recog.cc:3489
> > > 0x11dd9c8 execute
> > > /home/toon/compilers/gcc/gcc/recog.cc:4413
> > > Please submit a full bug report, with preprocessed source (by using
> > > -freport-bug).
> > > Please include the complete backtrace with any bug report.
> > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > make[3]: *** [Makefile:4584: matmul_i1.lo] Error 1
> > > make[3]: *** Waiting for unfinished jobs
> > >
> > > --
> > > Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
> > > Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> >
> > Looks like related to
> >
> > https://gcc.gnu.org/g:567d06bb357a39ece865cef67ada44124f227e45
> >
> > commit r14-2999-g567d06bb357a39ece865cef67ada44124f227e45
> > Author: Yan Simonaytes 
> > Date:   Tue Jul 25 20:43:19 2023 +0300
> >
> > i386: eliminate redundant operands of VPTERNLOG
> >
> > As mentioned in PR 110202, GCC may be presented with input where control
> > word of the VPTERNLOG intrinsic implies that some of its operands do not
> > affect the result.  In that case, we can eliminate redundant operands
> > of the instruction by substituting any other operand in their place.
> > This removes false dependencies.
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_mask,

2023-08-06 Thread Hongtao Liu via Gcc

On Mon, Aug 7, 2023 at 9:35 AM Hongtao Liu  wrote:
>
> On Mon, Aug 7, 2023 at 2:08 AM Toon Moene  wrote:
> >
> > Wonder if I am the only one to see this:
> >
> > https://gcc.gnu.org/pipermail/gcc-testresults/2023-August/792616.html
Could you share your GCC configure, I guess
--enable-checking=yes,rtl,extra is key to reproduce the issue?
> >
> > To quote:
> >
> > during RTL pass: split1
> > /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c: In function
> > 'matmul_i1_avx512f':
> > /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1:
> > internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have
> > 'w' (rtx const_int) in vpternlog_redundant_operand_mask, at
> > config/i386/i386.cc:19460
> >   1781 | }
> >| ^
> > during RTL pass: split1
> > /home/toon/compilers/gcc/libgfortran/generated/matmul_i2.c: In function
> > 'matmul_i2_avx512f':
> > /home/toon/compilers/gcc/libgfortran/generated/matmul_i2.c:1781:1:
> > internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have
> > 'w' (rtx const_int) in vpternlog_redundant_operand_mask, at
> > config/i386/i386.cc:19460
> >   1781 | }
> >| ^
> > 0x7a5cc7 rtl_check_failed_type2(rtx_def const*, int, int, int, char
> > const*, int, char const*)
> > /home/toon/compilers/gcc/gcc/rtl.cc:761
> > 0x82bf8d vpternlog_redundant_operand_mask(rtx_def**)
> > /home/toon/compilers/gcc/gcc/config/i386/i386.cc:19460
> > 0x1f1295b split_44
> > /home/toon/compilers/gcc/gcc/config/i386/sse.md:12730
> > 0x1f1295b split_63
> > /home/toon/compilers/gcc/gcc/config/i386/sse.md:28428
> > 0xe7663b try_split(rtx_def*, rtx_insn*, int)
> > /home/toon/compilers/gcc/gcc/emit-rtl.cc:3800
> > 0xe76cff try_split(rtx_def*, rtx_insn*, int)
> > /home/toon/compilers/gcc/gcc/emit-rtl.cc:3972
> > 0x11b2938 split_insn
> > /home/toon/compilers/gcc/gcc/recog.cc:3385
> > 0x11b2eff split_all_insns()
> > /home/toon/compilers/gcc/gcc/recog.cc:3489
> > 0x11dd9c8 execute
> > /home/toon/compilers/gcc/gcc/recog.cc:4413
> > Please submit a full bug report, with preprocessed source (by using
> > -freport-bug).
> > Please include the complete backtrace with any bug report.
> > See <https://gcc.gnu.org/bugs/> for instructions.
> > make[3]: *** [Makefile:4584: matmul_i1.lo] Error 1
> > make[3]: *** Waiting for unfinished jobs
> >
> > --
> > Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
> > Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
>
> Looks like related to
>
> https://gcc.gnu.org/g:567d06bb357a39ece865cef67ada44124f227e45
>
> commit r14-2999-g567d06bb357a39ece865cef67ada44124f227e45
> Author: Yan Simonaytes 
> Date:   Tue Jul 25 20:43:19 2023 +0300
>
> i386: eliminate redundant operands of VPTERNLOG
>
> As mentioned in PR 110202, GCC may be presented with input where control
> word of the VPTERNLOG intrinsic implies that some of its operands do not
> affect the result.  In that case, we can eliminate redundant operands
> of the instruction by substituting any other operand in their place.
> This removes false dependencies.
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_mask,

2023-08-06 Thread Hongtao Liu via Gcc

On Mon, Aug 7, 2023 at 2:08 AM Toon Moene  wrote:
>
> Wonder if I am the only one to see this:
>
> https://gcc.gnu.org/pipermail/gcc-testresults/2023-August/792616.html
>
> To quote:
>
> during RTL pass: split1
> /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c: In function
> 'matmul_i1_avx512f':
> /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1:
> internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have
> 'w' (rtx const_int) in vpternlog_redundant_operand_mask, at
> config/i386/i386.cc:19460
>   1781 | }
>| ^
> during RTL pass: split1
> /home/toon/compilers/gcc/libgfortran/generated/matmul_i2.c: In function
> 'matmul_i2_avx512f':
> /home/toon/compilers/gcc/libgfortran/generated/matmul_i2.c:1781:1:
> internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have
> 'w' (rtx const_int) in vpternlog_redundant_operand_mask, at
> config/i386/i386.cc:19460
>   1781 | }
>| ^
> 0x7a5cc7 rtl_check_failed_type2(rtx_def const*, int, int, int, char
> const*, int, char const*)
> /home/toon/compilers/gcc/gcc/rtl.cc:761
> 0x82bf8d vpternlog_redundant_operand_mask(rtx_def**)
> /home/toon/compilers/gcc/gcc/config/i386/i386.cc:19460
> 0x1f1295b split_44
> /home/toon/compilers/gcc/gcc/config/i386/sse.md:12730
> 0x1f1295b split_63
> /home/toon/compilers/gcc/gcc/config/i386/sse.md:28428
> 0xe7663b try_split(rtx_def*, rtx_insn*, int)
> /home/toon/compilers/gcc/gcc/emit-rtl.cc:3800
> 0xe76cff try_split(rtx_def*, rtx_insn*, int)
> /home/toon/compilers/gcc/gcc/emit-rtl.cc:3972
> 0x11b2938 split_insn
> /home/toon/compilers/gcc/gcc/recog.cc:3385
> 0x11b2eff split_all_insns()
> /home/toon/compilers/gcc/gcc/recog.cc:3489
> 0x11dd9c8 execute
> /home/toon/compilers/gcc/gcc/recog.cc:4413
> Please submit a full bug report, with preprocessed source (by using
> -freport-bug).
> Please include the complete backtrace with any bug report.
> See  for instructions.
> make[3]: *** [Makefile:4584: matmul_i1.lo] Error 1
> make[3]: *** Waiting for unfinished jobs
>
> --
> Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands

Looks like related to

https://gcc.gnu.org/g:567d06bb357a39ece865cef67ada44124f227e45

commit r14-2999-g567d06bb357a39ece865cef67ada44124f227e45
Author: Yan Simonaytes 
Date:   Tue Jul 25 20:43:19 2023 +0300

i386: eliminate redundant operands of VPTERNLOG

As mentioned in PR 110202, GCC may be presented with input where control
word of the VPTERNLOG intrinsic implies that some of its operands do not
affect the result.  In that case, we can eliminate redundant operands
of the instruction by substituting any other operand in their place.
This removes false dependencies.


-- 
BR,
Hongtao

Re: x86: making better use of vpternlog{d,q}

2023-05-24 Thread Hongtao Liu via Gcc

On Wed, May 24, 2023 at 3:58 PM Jan Beulich via Gcc  wrote:
>
> Hello,
>
> for a couple of years I was meaning to extend the use of these AVX512F
> insns beyond the pretty minimalistic ones there are so far. Now that I've
> got around to at least draft something, I ran into a couple of issues I
> cannot explain. I'd like to start with understanding the unexpected
> effects of a change to an existing insn I have made (reproduced at the
> bottom). I certainly was prepared to observe testsuite failures, but it
> ends up failing tests I didn't expect it would fail, and - upon looking
> at sibling ones - also ends up leaving intact tests which I would expect
> would then need adjustment (because of using the new alternative).
>
> In particular (all mentioned tests are in gcc.target/i386/)
> - avx512f-andn-si-zmm-1.c (and its AVX512VL counterparts) fails because
>   for whatever reason generated code reverts back to using vpbroadcastd,
> - avx512f-andn-di-zmm-1.c, otoh, is unaffected (i.e. continues to use
>   vpandnq with embedded broadcast),
> - avx512f-andn-si-zmm-2.c doesn't use the new 4th insn alternative when
>   at the same time a made-up DI variant of the test (akin to what might
>   be an avx512f-andn-di-zmm-2.c testcase) does.
> IOW: How is SI mode element size different here from DI mode one? Is
> there anything wrong with the 4th alternative I'm adding, or is this
> hinting at some anomaly elsewhere?
__m512i is defined as __v8di, when it's used for _mm512_andnot_epi32,
it's explicitlt converted to (__v16si) and creates an extra subreg
which is not needed for DImode cases.
And pass_combine try to match the below pattern but failed due to the
condition REG_P (operands[1]) || REG_P (operands[2]). Here I think you
want register_operand instead of REG_P.
157(set (reg:V16SI 91)
158(and:V16SI (not:V16SI (subreg:V16SI (reg:V8DI 98) 0))
159(vec_duplicate:V16SI (mem:SI (reg:DI 99) [1 *f_3(D)+0 S4 A32]


>
> Just to mention it, avx512f-andn-si-zmm-5.c similarly fails
> unexpectedly, but I guess for the same reason (and there aren't AVX512VL
> or DI mode element counterparts thereof).
>
> Jan
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -17019,11 +17019,11 @@
>"TARGET_AVX512F")
>
>  (define_insn "*andnot3"
> -  [(set (match_operand:VI 0 "register_operand" "=x,x,v")
> +  [(set (match_operand:VI 0 "register_operand" "=x,x,v,v")
> (and:VI
> - (not:VI (match_operand:VI 1 "vector_operand" "0,x,v"))
> - (match_operand:VI 2 "bcst_vector_operand" "xBm,xm,vmBr")))]
> -  "TARGET_SSE"
> + (not:VI (match_operand:VI 1 "bcst_vector_operand" "0,x,v,mBr"))
> + (match_operand:VI 2 "bcst_vector_operand" "xBm,xm,vmBr,v")))]
> +  "TARGET_SSE && (REG_P (operands[1]) || REG_P (operands[2]))"
>  {
>char buf[64];
>const char *ops;
> @@ -17090,6 +17090,11 @@
>  case 2:
>ops = "v%s%s\t{%%2, %%1, %%0|%%0, %%1, %%2}";
>break;
> +case 3:
> +  tmp = "pternlog";
> +  ssesuffix = "";
> +  ops = "v%s%s\t{$0x44, %%1, %%2, %%0|%%0, %%2, %%1, $0x44}";
> +  break;
>  default:
>gcc_unreachable ();
>  }
> @@ -17098,7 +17103,7 @@
>output_asm_insn (buf, operands);
>return "";
>  }
> -  [(set_attr "isa" "noavx,avx,avx")
> +  [(set_attr "isa" "noavx,avx,avx,avx512f")
> (set_attr "type" "sselog")
> (set (attr "prefix_data16")
>   (if_then_else
> @@ -17106,7 +17111,7 @@
> (eq_attr "mode" "TI"))
> (const_string "1")
> (const_string "*")))
> -   (set_attr "prefix" "orig,vex,evex")
> +   (set_attr "prefix" "orig,vex,evex,evex")
> (set (attr "mode")
> (cond [(match_test "TARGET_AVX2")
>  (const_string "")
> @@ -17119,7 +17124,11 @@
> (match_test "optimize_function_for_size_p (cfun)"))
>  (const_string "V4SF")
>   ]
> - (const_string "")))])
> + (const_string "")))
> +   (set (attr "enabled")
> +   (if_then_else (eq_attr "alternative" "3")
> + (symbol_ref " == 64 ? TARGET_AVX512F : 
> TARGET_AVX512VL")
> + (const_string "*")))])
>
>  ;; PR target/100711: Split notl; vpbroadcastd; vpand as vpbroadcastd; vpandn
>  (define_split



-- 
BR,
Hongtao

Re: [Intel SPR] Progress of GCC support for Intel SPR features

2022-02-06 Thread Hongtao Liu via Gcc

On Mon, Feb 7, 2022 at 11:16 AM LiYancheng via Gcc  wrote:
>
>
> On 2022/2/7 10:03, Andrew Pinski wrote:
> > On Sun, Feb 6, 2022 at 5:59 PM LiYancheng via Gcc  wrote:
> >> Hello everyone!
> >>
> >> I have some questions to ask:
> >>
> >> 1. How does GCC support Sapphrie Rapids CPU now?
> >>
> >> 2. Does GCC 11 fully support all the features of SPR?
> >>   From the release note, it seems that 5g ISA (fp16)/hfni is
> >> not supported yet.
> > It will be included in GCC 12 which should be released in less than 4 
> > months.
> Thank you for your reply！
> >> 3. What is the simulation tool used by GCC to verify SPR characteristics?
> >> Is it open source?
> > Intel is doing the patching to GCC and binutils so I suspect they
> > verify using their internal tools and I highly doubt it is free
> > source.
> >
> >
> > Thanks,
> > Andrew Pinski
> >
> Any suggestions from Intel?
>
You can use Intel SDE(software-development-emulator)
refer to 
https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html.

And please use GCC12(main trunk, not released yet), and binutils
2.38(main trunk, not released yet).


> Thanks!
>
> yancheng
>
> >> Thanks for all the help,
> >>
> >> yancheng
> >>



-- 
BR,
Hongtao

Re: _Float16-related failures on x86_64-apple-darwin

2021-12-23 Thread Hongtao Liu via Gcc

gcc define __FLT_EVAL_METHOD__ according to

  builtin_define_with_int_value ("__FLT_EVAL_METHOD__",
c_flt_eval_method (true));

and guess we need to handle things like:

   /* GCC only supports one interchange type right now, _Float16.  If
  we're evaluating _Float16 in 16-bit precision, then flt_eval_method
  will be FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16.  */
+  if (x == FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16
+  && x == y)
+return FLT_EVAL_METHOD_PROMOTE_TO_FLOAT;
   if (x == FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16)
 return y;

I'm testing the patch but still need approval from related MAINTAINERs.

On Fri, Dec 24, 2021 at 7:15 AM FX via Gcc  wrote:
>
> > I’m not sure what the fix should be, either. We could use fixinclude to 
> > make the darwin headers happy, but we don’t really have a macro to provide 
> > the right value. Like a __FLT_EVAL_METHOD_OLDSTYLE__ macro.
> >
> > What should be the float_t and double_t types for FLT_EVAL_METHOD == 16? 
> > float and double, if I understand right?
>
> This is one possibility, assuming I am right about the types:
>
> diff --git a/fixincludes/inclhack.def b/fixincludes/inclhack.def
> index 46e3b8c993a..bea85ef7367 100644
> --- a/fixincludes/inclhack.def
> +++ b/fixincludes/inclhack.def
> @@ -1767,6 +1767,18 @@ fix = {
>  test_text = ""; /* Don't provide this for wrap fixes.  */
>  };
>
> +/*  The darwin headers don't accept __FLT_EVAL_METHOD__ == 16.
> +*/
> +fix = {
> +hackname  = darwin_flt_eval_method;
> +mach  = "*-*-darwin*";
> +files = math.h;
> +select= "^#if __FLT_EVAL_METHOD__ == 0$";
> +c_fix = format;
> +c_fix_arg = "#if __FLT_EVAL_METHOD__ == 0 || __FLT_EVAL_METHOD__ == 16";
> +test_text = "#if __FLT_EVAL_METHOD__ == 0";
> +};
> +
>  /*
>   *  Fix  on Digital UNIX V4.0:
>   *  It contains a prototype for a DEC C internal asm() function,
>
>
> Sucks to have to fix headers… and we certainly can’t fix people’s code that 
> may depend on __FLT_EVAL_METHOD__ have well-defined values. So not convinced 
> this is the right approach.
>
> FX



-- 
BR,
Hongtao

Re: Enable the vectorizer at -O2 for GCC 12

2021-09-06 Thread Hongtao Liu via Gcc

On Wed, Sep 1, 2021 at 7:24 PM Tamar Christina via Gcc  wrote:
>
> -- edit, added list back in --
>
> Just to add some AArch64 numbers for Spec2017 we see 2.1% overall Geomean 
> improvements (all from x264 as expected) with no real regressions (everything 
> within variance) and only a 0.06% binary size increase overall (of which x264 
> grew 0.15%) using the very cheap cost model.
>
> So we'd be quite keen on this as well.
>
> Cheers,
> Tamar
>
> > -Original Message-
> > From: Gcc  On Behalf
> > Of Florian Weimer via Gcc
> > Sent: Monday, August 30, 2021 2:05 PM
> > To: gcc@gcc.gnu.org
> > Cc: ja...@redhat.com; Richard Earnshaw ;
> > Segher Boessenkool ; Richard Sandiford
> > ; premachandra.malla...@amd.com;
> > Hongtao Liu 
> > Subject: Enable the vectorizer at -O2 for GCC 12
> >
> > There has been a discussion, both off-list and on the gcc-help mailing list
> > (“Why vectorization didn't turn on by -O2”, spread across several months),
> > about enabling the auto-vectorizer at -O2, similar to what Clang does.
> >
> > I think the review concluded that the very cheap cost model should be used
> > for that.
> >
> > Are there any remaining blockers?
> >
> > Thanks,
> > Florian
>

A patch is posted at [1] to enable auto-vectorization at O2 w/
very-cheap cost mode.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578877.html

-- 
BR,
Hongtao

Re: Enable the vectorizer at -O2 for GCC 12

2021-08-30 Thread Hongtao Liu via Gcc

On Tue, Aug 31, 2021 at 11:11 AM Kewen.Lin via Gcc  wrote:
>
> on 2021/8/30 下午10:11, Bill Schmidt wrote:
> > On 8/30/21 8:04 AM, Florian Weimer wrote:
> >> There has been a discussion, both off-list and on the gcc-help mailing
> >> list (“Why vectorization didn't turn on by -O2”, spread across several
> >> months), about enabling the auto-vectorizer at -O2, similar to what
> >> Clang does.
> >>
> >> I think the review concluded that the very cheap cost model should be
> >> used for that.
> >>
> >> Are there any remaining blockers?
> >
> > Hi Florian,
> >
> > I don't think I'd characterize it as having blockers, but we are continuing 
> > to investigate small performance issues that arise with very-cheap, 
> > including some things that regressed in GCC 12.  Kewen Lin is leading that 
> > effort.  Kewen, do you feel we have any major remaining concerns with this 
> > plan?
> >
>
> Hi Florian & Bill,
>
> There are some small performance issues like PR101944 and PR102054, and
> still two degraded bmks (P9 520.omnetpp_r -2.41% and P8 526.blender_r
> -1.31%) to be investigated/clarified, but since their performance numbers
> with separated loop and slp vectorization options look neutral, they are
> very likely noises.  IMHO I don't think they are/will be blockers.
>
> So I think it's good to turn this on by default for Power.
The intel side is also willing to enable O2 vectorization after
measuring performance impact for SPEC2017 and eembc.
Meanwhile we are investigating PR101908/PR101909/PR101910/PR92740
which are reported O2 vectorization regresses extra benchmarks on
znver and kabylake.
>
> BR,
> Kewen



-- 
BR,
Hongtao

Re: Why vectorization didn't turn on by -O2

2021-08-05 Thread Hongtao Liu via Gcc

On Thu, Aug 5, 2021 at 5:20 AM Segher Boessenkool
 wrote:
>
> On Wed, Aug 04, 2021 at 11:22:53AM +0100, Richard Sandiford wrote:
> > Segher Boessenkool  writes:
> > > On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> > >> Richard Biener  writes:
> > >> > Alternatively only enable loop vectorization at -O2 (the above checks
> > >> > flag_tree_slp_vectorize as well).  At least the cost model kind
> > >> > does not have any influence on BB vectorization, that is, we get the
> > >> > same pros and cons as we do for -O3.
> > >>
> > >> Yeah, but a lot of the loop vector cost model choice is about controlling
> > >> code size growth and avoiding excessive runtime versioning tests.
> > >
> > > Both of those depend a lot on the target, and target-specific conditions
> > > as well (which CPU model is selected for example).  Can we factor that
> > > in somehow?  Maybe we need some target hook that returns the expected
> > > percentage code growth for vectorising a given loop, for example, and
> > > -O2 vs. -O3 then selects what percentage is acceptable.
> > >
> > >> BB SLP
> > >> should be a win on both code size and performance (barring significant
> > >> target costing issues).
> > >
> > > Yeah -- but this could use a similar hook as well (just a straightline
> > > piece of code instead of a loop).
> >
> > I think anything like that should be driven by motivating use cases.
> > It's not something that we can easily decide in the abstract.
> >
> > The results so far with using very-cheap at -O2 have been promising,
> > so I don't think new hooks should block that becoming the default.
>
> Right, but it wouldn't hurt to think a sec if we are on the right path
> forward.  It's is crystal clear that to make good decisions about what
> and how to vectorise you need to take *some* target characteristics into
> account, and that will have to happen sooner rather than later.
>
> This was all in reply to
>
> > >> Yeah, but a lot of the loop vector cost model choice is about controlling
> > >> code size growth and avoiding excessive runtime versioning tests.
>
> It was not meant to hold up these patches :-)
>
> > >> PR100089 was an exception because we ended up keeping unvectorised
> > >> scalar code that would never have existed otherwise.  BB SLP proper
> > >> shouldn't have that problem.
> > >
> > > It also is a tiny piece of code.  There will always be tiny examples
> > > that are much worse (or much better) than average.
> >
> > Yeah, what makes PR100089 important isn't IMO the test itself, but the
> > underlying problem that the PR exposed.  Enabling this “BB SLP in loop
> > vectorisation” code can lead to the generation of scalar COND_EXPRs even
> > though we know that ifcvt doesn't have a proper cost model for deciding
> > whether scalar COND_EXPRs are a win.
> >
> > Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
> > (although still dubious), but I think it's something we need to avoid
> > for -O2, even if that means losing the optimisation.
>
> Yeah -- -O2 should almost always do the right thing, while -O3 can do
> bad things more often, it just has to be better "on average".
>
>
> Segher

Move thread to gcc-patches and gcc

-- 
BR,
Hongtao

Re: Suboptimal code generated for __buitlin_ceil on AMD64 without SS4_4.1

2021-08-05 Thread Hongtao Liu via Gcc

Could you file a bugzilla for that?
https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc

On Thu, Aug 5, 2021 at 3:34 PM Stefan Kanthak  wrote:
>
> Hi,
>
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (17 instructions using 78 bytes, plus 6 quadwords
> using 48 bytes) for __builtin_ceil() when -msse4.1 is NOT given:
>
> .text
>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> 4: R_X86_64_PC32.rdata
>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> c: R_X86_64_PC32.rdata
>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>   20:   76 2b   jbe4d <_ceil+0x4d>
>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>   27:   66 0f ef db pxor   %xmm3, %xmm3
>   2b:   f2 0f 10 25 20 00 00 00 movsd  0x20(%rip), %xmm4
> 2f: R_X86_64_PC32   .rdata
>   33:   66 0f 55 d1 andnpd %xmm1, %xmm2
>   37:   f2 48 0f 2a d8  cvtsi2sd %rax, %xmm3
>   3c:   f2 0f c2 c3 06  cmpnlesd %xmm3, %xmm0
>   41:   66 0f 54 c4 andpd  %xmm4, %xmm0
>   45:   f2 0f 58 c3 addsd  %xmm3, %xmm0
>   49:   66 0f 56 c2 orpd   %xmm2, %xmm0
>   4d:   c3  retq
>
> .rdata
> .align 8
>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> 00 00 30 43
> 00 00 00 00
> 00 00 00 00
> .align 16
>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> ff ff ff 7f
>   18:   00 00 00 00 .quad  0.0
> 00 00 00 00
> .align 8
>   20:   00 00 00 00 .LC2:   .quad  0x1.0p0
> 00 00 f0 3f
> 00 00 00 00
> 00 00 00 00
> .end
>
> JFTR: in the best case, the memory accesses cost several cycles,
>   while in the worst case they yield a page fault!
>
>
> Properly optimized, faster and shorter code, using just 15 instructions
> in 65 bytes, WITHOUT superfluous constants, thus avoiding costly memory
> accesses and saving at least 32 bytes, follows:
>
>   .intel_syntax
>   .equBIAS, 1023
>   .text
>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>5:   48 f7 d8  neg rax
> # jz  .L0  # argument zero?
>8:   70 36 jo  .L0  # argument indefinite?
># argument overflows 
> 64-bit integer?
>a:   48 f7 d8  neg rax
>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   48 a1 00 00 00mov rax, BIAS << 52
>   19:   00 00 00 f0 3f
>   1c:   66 48 0f 6e d0movqxmm2, rax# xmm2 = 0x1.0p0
>   21:   f2 0f 10 d8   movsd   xmm3, xmm0   # xmm3 = argument
>   25:   f2 0f c2 d9 02cmplesd xmm3, xmm1   # xmm3 = (argument <= 
> trunc(argument)) ? ~0L : 0L
>   2a:   66 0f 55 da   andnpd  xmm3, xmm2   # xmm3 = (argument <= 
> trunc(argument)) ? 0.0 : 1.0
>   2e:   f2 0f 58 d9   addsd   xmm3, xmm1   # xmm3 = (argument > 
> trunc(argument)) ? 1.0 : 0.0
>#  + trunc(argument)
>#  = ceil(argument)
>   32:   66 0f 73 d0 3fpsrlq   xmm0, 63
>   37:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
> ? -0.0 : 0.0
>   3c:   66 0f 56 c3   orpdxmm0, xmm3   # xmm0 = ceil(argument)
>   40:   c3  .L0:  ret
>   .end
>
> regards
> Stefan



-- 
BR,
Hongtao

Re: How to detect user uses -masm=intel?

2021-07-28 Thread Hongtao Liu via Gcc

On Thu, Jul 29, 2021 at 10:49 AM unlvsur unlvsur via Gcc
 wrote:
>
> What I mean is that what macro GCC sets when it compiles -masm=intel
>
>
> Int main()
> {
> #ifdef /*__INTEL_ASM*/
> printf(“intel”);
> #else
> printf(“at&t”);
> #endif
> }
not fully understand what you're seeking, probably you're looking for
ASSEMBLER_DIALECT.

cut from i386.c
---
void
ix86_print_operand (FILE *file, rtx x, int code)
{
  if (code)
{
  switch (code)
{
case 'A':
  switch (ASSEMBLER_DIALECT)
{
case ASM_ATT:
  putc ('*', file);
  break;

case ASM_INTEL:
  /* Intel syntax. For absolute addresses, registers should not
be surrounded by braces.  */
  if (!REG_P (x))
{
  putc ('[', file);
  ix86_print_operand (file, x, 0);
  putc (']', file);
  return;
}
  break;
--

> Sent from Mail for Windows 10
>
> From: Andrew Pinski
> Sent: Wednesday, July 28, 2021 21:43
> To: unlvsur unlvsur
> Cc: gcc@gcc.gnu.org
> Subject: Re: How to detect user uses -masm=intel?
>
> On Wed, Jul 28, 2021 at 6:41 PM unlvsur unlvsur via Gcc  
> wrote:
> >
> > Any GCC macro that can tell the code it is using the intel format’s 
> > assembly instead of at&t??
>
> Inside the inline-asm you can use the alternative.
> Like this:
> cmp{b}\t{%1, %h0|%h0, %1}
>
> This is how GCC implements this inside too.
>
> Thanks,
> Andrew
>
> >
> > Sent from 
> > Mail
> >  for Windows 10
> >
>


-- 
BR,
Hongtao

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

2021-07-14 Thread Hongtao Liu via Gcc

On Wed, Jul 14, 2021 at 4:17 PM Richard Biener
 wrote:
>
> On Wed, Jul 14, 2021 at 10:11 AM Hongtao Liu  wrote:
> >
> > On Wed, Jul 14, 2021 at 3:49 PM Matthias Kretz  wrote:
> > >
> > > On Wednesday, 14 July 2021 09:39:42 CEST Richard Biener wrote:
> > > > -ffast-math decomposes to quite some flag_* and those generally are not
> > > > reflected into the IL but can be different per function (and then
> > > > prevent inlining).
> > >
> > > Is there any chance the "and then prevent inlining" can be eliminated? 
> > > Because
> > > then I could write my own fast class in C++, marking all operators 
> > > with
> > > __attribute__((optimize("-Ofast")))...
> > >
> > > > There's one "related" IL feature used by the Fortran frontend - 
> > > > PAREN_EXPR
> > > > prevents association across it.  So for Fortran (when not
> > > > -fno-protect-parens which is enabled by -Ofast), (a + b) - b cannot be
> > > > optimized to a.  Eventually this could be used to wrap intrinsic results
> > > > since most of the issues in the end require association.  Note 
> > > > PAREN_EXPR
> > > > isn't exposed to the C family frontends but we could of course add a
> > > > builtin-like thing for this _Noassoc (  ) or so.  Note PAREN_EXPR
> > after a simple grep, I see PAREN_EXPR is expanded to the common RTL
> > pattern. So it doesn't prevent any reassociation at the rtl level?
>
> We don't perform any FP reassociation on RTL (and yes, the above relies on
-ffast-math will imply flag_associative_math, and w/ that we do have
reassociation on RTL

  /* Reassociate floating point addition only when the user
specifies associative math operations.  */
  if (FLOAT_MODE_P (mode)
  && flag_associative_math)
{
  tem = simplify_associative_operation (code, mode, op0, op1);
  if (tem)
return tem;
}


> this).  We're also expanding rint() to x + 2**52 - 2**52 (ix86_expand_rint) 
> even
> with -ffast-math so we do rely on RTL optimizations not cancelling the +-.
>
> Richard.
>
> >
> > > > survives -Ofast so it's the frontends that would need to choose to emit 
> > > > or
> > > > not emit it (or always emit it).
> > >
> > > Interesting. I want that builtin in C++. Currently I use inline asm to 
> > > achieve
> > > a similar effect. But the inline asm hammer is really too big for the 
> > > problem.
> > >
> > >
> > > --
> > > ──
> > >  Dr. Matthias Kretz   https://mattkretz.github.io
> > >  GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
> > >  std::experimental::simd  https://github.com/VcDevel/std-simd
> > > ──
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

2021-07-14 Thread Hongtao Liu via Gcc

On Wed, Jul 14, 2021 at 3:49 PM Matthias Kretz  wrote:
>
> On Wednesday, 14 July 2021 09:39:42 CEST Richard Biener wrote:
> > -ffast-math decomposes to quite some flag_* and those generally are not
> > reflected into the IL but can be different per function (and then
> > prevent inlining).
>
> Is there any chance the "and then prevent inlining" can be eliminated? Because
> then I could write my own fast class in C++, marking all operators with
> __attribute__((optimize("-Ofast")))...
>
> > There's one "related" IL feature used by the Fortran frontend - PAREN_EXPR
> > prevents association across it.  So for Fortran (when not
> > -fno-protect-parens which is enabled by -Ofast), (a + b) - b cannot be
> > optimized to a.  Eventually this could be used to wrap intrinsic results
> > since most of the issues in the end require association.  Note PAREN_EXPR
> > isn't exposed to the C family frontends but we could of course add a
> > builtin-like thing for this _Noassoc (  ) or so.  Note PAREN_EXPR
after a simple grep, I see PAREN_EXPR is expanded to the common RTL
pattern. So it doesn't prevent any reassociation at the rtl level?

> > survives -Ofast so it's the frontends that would need to choose to emit or
> > not emit it (or always emit it).
>
> Interesting. I want that builtin in C++. Currently I use inline asm to achieve
> a similar effect. But the inline asm hammer is really too big for the problem.
>
>
> --
> ──
>  Dr. Matthias Kretz   https://mattkretz.github.io
>  GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
>  std::experimental::simd  https://github.com/VcDevel/std-simd
> ──



-- 
BR,
Hongtao

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

2021-07-14 Thread Hongtao Liu via Gcc

On Wed, Jul 14, 2021 at 2:39 PM Matthias Kretz  wrote:
>
> On Wednesday, 14 July 2021 07:18:29 CEST Hongtao Liu via Gcc-help wrote:
> > On Wed, Jul 14, 2021 at 1:15 PM Hongtao Liu  wrote:
> > > Hi:
> > >   The original problem was that some users wanted the cmdline option
> > >
> > > -ffast-math not to act on intrinsic production code.
>
> This sounds like the users want intrinsics to map *directly* to the
Thanks for the reply.
I think the users want the mixed usage of fast-math and no-fast-math.
> corresponding instruction. If that's the case such users should use inline
> assembly, IMHO. If you compile a TU with -ffast-math then *all* floating-point
> operations are affected. Yes, more control over where to use fast-math and the
> ability to mix fast-math and no-fast-math without risking ODR violations would
> be great. But that's a larger issue, and one that would ideally be solved in
> WG14/WG21.
hmm, guess it would need a lot of work.
>
> FWIW, this is what I'd do, i.e. turn off fast-math for the function in
> question:
> https://godbolt.org/z/3cKq5hT1o
>
> --
> ──
>  Dr. Matthias Kretz   https://mattkretz.github.io
>  GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
>  std::experimental::simd  https://github.com/VcDevel/std-simd
> ──



-- 
BR,
Hongtao

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

2021-07-13 Thread Hongtao Liu via Gcc

On Wed, Jul 14, 2021 at 1:15 PM Hongtao Liu  wrote:
>
> Hi:
>   The original problem was that some users wanted the cmdline option
> -ffast-math not to act on intrinsic production code. .i.e for codes
> like
>
> #include
> __m256d
> foo2 (__m256d a, __m256d b, __m256d c, __m256d d)
> {
> __m256d tmp = _mm256_add_pd (a, b);
> tmp = _mm256_sub_pd (tmp, c);
> tmp = _mm256_sub_pd (tmp, d);
> return tmp;
> }
>
> compiled with -O2 -mavx2 -ffast-math, users expected codes generated like
>
> vaddpd ymm0, ymm0, ymm1
> vsubpd ymm0, ymm0, ymm2
> vsubpd ymm0, ymm0, ymm3
>
> but not
>
> vsubpd ymm1, ymm1, ymm2
> vsubpd ymm0, ymm0, ymm3
> vaddpd ymm0, ymm1, ymm0
>
>
> For the LLVM side, there're mechanisms like
> #pragma float_control( precise, on, push)
> ...(intrinsics definition)..
> #pragma float_control(pop)
>
> When intrinsics are inlined, their IRs will be marked with
> "no-fast-math", and even if the caller is compiled with -ffast-math,
> reassociation only happens to those IRs which are not marked with
> "no-fast-math". It seems to be more flexible to support fast math
> control of a region(inside a function).
Testcase
https://godbolt.org/z/9cYMGGWPG
>
> Does GCC have a similar mechanism?
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

[Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

2021-07-13 Thread Hongtao Liu via Gcc

Hi:
  The original problem was that some users wanted the cmdline option
-ffast-math not to act on intrinsic production code. .i.e for codes
like

#include
__m256d
foo2 (__m256d a, __m256d b, __m256d c, __m256d d)
{
__m256d tmp = _mm256_add_pd (a, b);
tmp = _mm256_sub_pd (tmp, c);
tmp = _mm256_sub_pd (tmp, d);
return tmp;
}

compiled with -O2 -mavx2 -ffast-math, users expected codes generated like

vaddpd ymm0, ymm0, ymm1
vsubpd ymm0, ymm0, ymm2
vsubpd ymm0, ymm0, ymm3

but not

vsubpd ymm1, ymm1, ymm2
vsubpd ymm0, ymm0, ymm3
vaddpd ymm0, ymm1, ymm0


For the LLVM side, there're mechanisms like
#pragma float_control( precise, on, push)
...(intrinsics definition)..
#pragma float_control(pop)

When intrinsics are inlined, their IRs will be marked with
"no-fast-math", and even if the caller is compiled with -ffast-math,
reassociation only happens to those IRs which are not marked with
"no-fast-math". It seems to be more flexible to support fast math
control of a region(inside a function).

Does GCC have a similar mechanism?


-- 
BR,
Hongtao

Re: Hongtao Liu as x86 vectorization maintainer

2021-06-22 Thread Hongtao Liu via Gcc

On Tue, Jun 22, 2021 at 3:58 PM Jakub Jelinek via Gcc  wrote:
>
> On Mon, Jun 21, 2021 at 02:49:56AM +, Liu, Hongtao via Gcc wrote:
> > >-Original Message-
> > >From: Jason Merrill 
> > >Sent: Monday, June 21, 2021 10:07 AM
> > >To: Liu, Hongtao 
> > >Cc: gcc Mailing List ; Marek Polacek 
> > >Subject: Hongtao Liu as x86 vectorization maintainer
> > >
> > >I am pleased to announce that the GCC Steering Committee has appointed
> > >Hongtao Liu as maintainer of the i386 vector extensions in GCC.
> > >
> > >Hongtao, please update your listing in the MAINTAINERS file.
> >
> > Updated, thanks.
>
> Congrats.
>
> You should also remove your Write After Approval entry, otherwise
> Running .../gcc/testsuite/gcc.src/maintainers.exp ...
> Redundant in write approval: Hongtao Liu
> FAIL: maintainers-verify.sh
> test fails.
>
Thanks for reminding, updated.
> Jakub
>


-- 
BR,
Hongtao

Re: Help with PR97872

2020-12-09 Thread Hongtao Liu via Gcc

It seems better with your PR97872 fix on i386.

Cat test.c

typedef char v16qi __attribute__ ((vector_size(16)));
v16qi f1(v16qi a, v16qi b) {
return (a & b) != 0;
}

before

f1(char __vector(16), char __vector(16)):
pand %xmm1, %xmm0
pxor %xmm1, %xmm1
pcmpeqb %xmm1, %xmm0
pcmpeqd %xmm1, %xmm1
pandn %xmm1, %xmm0
ret

After the pr97872 fix

f1(char __vector(16), char __vector(16)):
pand xmm0, xmm1
pxor xmm1, xmm1
pcmpeqb xmm0, xmm1
pcmpeqb xmm0, xmm1
ret

On Wed, Dec 9, 2020 at 7:47 PM Prathamesh Kulkarni
 wrote:
>
> On Tue, 8 Dec 2020 at 14:36, Prathamesh Kulkarni
>  wrote:
> >
> > On Mon, 7 Dec 2020 at 17:37, Hongtao Liu  wrote:
> > >
> > > On Mon, Dec 7, 2020 at 7:11 PM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Mon, 7 Dec 2020 at 16:15, Hongtao Liu  wrote:
> > > > >
> > > > > On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > >
> > > > > > > On Mon, 7 Dec 2020 at 13:01, Richard Biener  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > >
> > > > > > > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > >
> > > > > > > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > For the test mentioned in PR, I was trying to see 
> > > > > > > > > > > > > > > if we could do
> > > > > > > > > > > > > > > specialized expansion for vcond in target when 
> > > > > > > > > > > > > > > operands are -1 and 0.
> > > > > > > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > where r117 and r118 are set to vector constants 
> > > > > > > > > > > > > > > -1 and 0 respectively.
> > > > > > > > > > > > > > > However, I am not sure if there's a way to check 
> > > > > > > > > > > > > > > if the register is
> > > > > > > > > > > > > > > constant during expansion time (since we don't 
> > > > > > > > > > > > > > > have df analysis yet) ?
> > > > >
> > > > > It seems to me that all you need to do is relax the predicates of op1
> > > > > and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
> > > > > debugged it, but I see that vcondmn in neon.md only accepts
> > > > > s_register_operand.
>

Re: Help with PR97872

2020-12-07 Thread Hongtao Liu via Gcc

On Mon, Dec 7, 2020 at 7:11 PM Prathamesh Kulkarni
 wrote:
>
> On Mon, 7 Dec 2020 at 16:15, Hongtao Liu  wrote:
> >
> > On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  wrote:
> > >
> > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > >
> > > > On Mon, 7 Dec 2020 at 13:01, Richard Biener  wrote:
> > > > >
> > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > >
> > > > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > >
> > > > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > For the test mentioned in PR, I was trying to see if we 
> > > > > > > > > > > > could do
> > > > > > > > > > > > specialized expansion for vcond in target when operands 
> > > > > > > > > > > > are -1 and 0.
> > > > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > > > >
> > > > > > > > > > > > where r117 and r118 are set to vector constants -1 and 
> > > > > > > > > > > > 0 respectively.
> > > > > > > > > > > > However, I am not sure if there's a way to check if the 
> > > > > > > > > > > > register is
> > > > > > > > > > > > constant during expansion time (since we don't have df 
> > > > > > > > > > > > analysis yet) ?
> >
> > It seems to me that all you need to do is relax the predicates of op1
> > and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
> > debugged it, but I see that vcondmn in neon.md only accepts
> > s_register_operand.
> >
> > (define_expand "vcond"
> >   [(set (match_operand:VDQW 0 "s_register_operand")
> > (if_then_else:VDQW
> >   (match_operator 3 "comparison_operator"
> > [(match_operand:VDQW 4 "s_register_operand")
> >  (match_operand:VDQW 5 "reg_or_zero_operand")])
> >   (match_operand:VDQW 1 "s_register_operand")
> >   (match_operand:VDQW 2 "s_register_operand")))]
> >   "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
> > {
> >   arm_expand_vcond (operands, mode);
> >   DONE;
> > })
> >
> > in sse.md it's defined as
> > (define_expand "vcondu"
> >   [(set (match_operand:V_512 0 "register_operand")
> > (if_then_else:V_512
> >   (match_operator 3 ""
> > [(match_operand:VI_AVX512BW 4 "nonimmediate_operand")
> >  (match_operand:VI_AVX512BW 5 "nonimmediate_operand")])
> >   (match_operand:V_512 1 "general_operand")
> >   (match_operand:V_512 2 "general_operand")))]
> >   "TARGET_AVX512F
> >&& (GET_MODE_NUNITS (mode)
> >== GET_MODE_NUNITS (mode))"
> > {
> >   bool ok = ix86_expand_int_vcond (operands);
> >   gcc_assert (ok);
> >   DONE;

Re: Help with PR97872

2020-12-07 Thread Hongtao Liu via Gcc

On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  wrote:
>
> On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
>
> > On Mon, 7 Dec 2020 at 13:01, Richard Biener  wrote:
> > >
> > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > >
> > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener  wrote:
> > > > >
> > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > >
> > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > > For the test mentioned in PR, I was trying to see if we 
> > > > > > > > > > could do
> > > > > > > > > > specialized expansion for vcond in target when operands are 
> > > > > > > > > > -1 and 0.
> > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > >
> > > > > > > > > > where r117 and r118 are set to vector constants -1 and 0 
> > > > > > > > > > respectively.
> > > > > > > > > > However, I am not sure if there's a way to check if the 
> > > > > > > > > > register is
> > > > > > > > > > constant during expansion time (since we don't have df 
> > > > > > > > > > analysis yet) ?

It seems to me that all you need to do is relax the predicates of op1
and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
debugged it, but I see that vcondmn in neon.md only accepts
s_register_operand.

(define_expand "vcond"
  [(set (match_operand:VDQW 0 "s_register_operand")
(if_then_else:VDQW
  (match_operator 3 "comparison_operator"
[(match_operand:VDQW 4 "s_register_operand")
 (match_operand:VDQW 5 "reg_or_zero_operand")])
  (match_operand:VDQW 1 "s_register_operand")
  (match_operand:VDQW 2 "s_register_operand")))]
  "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
{
  arm_expand_vcond (operands, mode);
  DONE;
})

in sse.md it's defined as
(define_expand "vcondu"
  [(set (match_operand:V_512 0 "register_operand")
(if_then_else:V_512
  (match_operator 3 ""
[(match_operand:VI_AVX512BW 4 "nonimmediate_operand")
 (match_operand:VI_AVX512BW 5 "nonimmediate_operand")])
  (match_operand:V_512 1 "general_operand")
  (match_operand:V_512 2 "general_operand")))]
  "TARGET_AVX512F
   && (GET_MODE_NUNITS (mode)
   == GET_MODE_NUNITS (mode))"
{
  bool ok = ix86_expand_int_vcond (operands);
  gcc_assert (ok);
  DONE;
})

then we can get operands[1] and operands[2] as

(gdb) p debug_rtx (operands[1])
 (const_vector:V16QI [
(const_int -1 [0x]) repeated x16
])
(gdb) p debug_rtx (operands[2])
(reg:V16QI 82 [ _2 ])
(const_vector:V16QI [
(const_int 0 [0]) repeated x16
])

> > > > > > > > > >
> > > > > > > > > > Alternatively, should we add a target hook that returns 
> > > > > > > > > > true if the
> > > > > > > > > > result of vector comparison is set to all-ones or 
> > > > > > > > > > all-zeros, and then
> > > > > > > > > > use this hook in gimple ISEL to effectively turn 
> > > > > > > > > > VEC_COND_EXPR into nop ?
> > > > > > > > >
> > > > > > > > > Would everything match-up for a .VEC_CMP IFN producing a 
> > > > > > > > > non-mask
> > > > > > > > > vector type?  ISEL could special case the a ? -1 : 0 case 
> > > > > > > > > this way.
> > > > > > > > I think the vec_cmp pattern matches but it produces a masked 
> > > > > > > > vector type.
> > > > > > > > In the attached patch, I simply replaced:
> > > > > > > > _1 = a < b
> > > > > > > > x = _1 ? -1 : 0
> > > > > > > > with
> > > > > > > > x = view_convert_expr<_1>
> > > > > > > >
> > > > > > > > For the test-case, isel generates:
> > > > > > > >   vector(8)  _1;
> > > > > > > >   vector(8) signed char _2;
> > > > > > > >   uint8x8_t _5;
> > > > > > > >
> > > > > > > >[local count: 1073741824]:
> > > > > > > >   _1 = a_3(D) < b_4(D);
> > > > > > > >   _2 = VIEW_CONVERT_EXPR(_1);
> > > > > > > >   _5 = VIEW_CONVERT_EXPR(_2);
> > > > > > > >   return _5;
> > > > > > > >
> > > > > > > > and results in desired code-gen:
> > > > > > > > f1:
> > > > > > > > vcgt.s8 d0, d1, d0
> > > > > > > > bx  lr
> > > > > > > >
> > > > > > > > Altho I guess, we should remove the redundant conversions 
> > > > > > > > during isel itself ?
> > > > > > > > and result in:
> > > > > > > > _1 = a_3(D) < b_4(D)
> > > > > > > > _5 = VIEW_CONVERT_EXPR(_1)
> > > > > > > >
> > > > > > > > (Patch is lightly tested with only vect.exp)
> > > > > > >
> > > > >

Re: [IMPORTANT] ChangeLog related changes

2020-05-25 Thread Hongtao Liu via Gcc

Great, thanks!

On Tue, May 26, 2020 at 2:08 PM Martin Liška  wrote:
>
> On 5/26/20 7:22 AM, Hongtao Liu via Gcc wrote:
> > i commit a separate patch alone only for ChangeLog files, should i revert 
> > it?
>
> Hello.
>
> I've just done it.
>
> Martin



-- 
BR,
Hongtao

Re: [IMPORTANT] ChangeLog related changes

2020-05-25 Thread Hongtao Liu via Gcc

On Tue, May 26, 2020 at 6:49 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> Hi!
>
> I've turned the strict mode of Martin Liška's hook changes,
> which means that from now on no commits to the trunk or release branches
> should be changing any ChangeLog files together with the other files,
> ChangeLog entry should be solely in the commit message.
> The DATESTAMP bumping script will be updating the ChangeLog files for you.
Oh, no wonder my patch was rejected by git hook with error message
---
ChangeLog files, DATESTAMP, BASE-VER and DEV-PHASE can be modified
only separately from other files
---
> If somebody makes a mistake in that, please wait 24 hours (at least until
i commit a separate patch alone only for ChangeLog files, should i revert it?
> after 00:16 UTC after your commit) so that the script will create the
> ChangeLog entries, and afterwards it can be fixed by adjusting the ChangeLog
> files.  But you can only touch the ChangeLog files in that case (and
> shouldn't write a ChangeLog entry for that in the commit message).
>
> If anything goes wrong, please let me, other RMs and Martin Liška know.
>
> Jakub
>

-- 
BR,
Hongtao

Re: [9/10 Regression] [PR87833] Intel MIC (emulated) offloading still broken (was: GCC 9.0.1 Status Report (2019-04-25))

2019-05-07 Thread Hongtao Liu

On Tue, Apr 30, 2019 at 7:31 PM Jakub Jelinek  wrote:
>
> On Tue, Apr 30, 2019 at 01:02:40PM +0200, Thomas Schwinge wrote:
> > Hi Jakub!
> >
> > On Tue, 30 Apr 2019 12:56:52 +0200, Jakub Jelinek  wrote:
> > > On Tue, Apr 30, 2019 at 12:47:54PM +0200, Thomas Schwinge wrote:
> > > > Email to  apparently is no longer gets delivered.
> > > > Is there anyone else from Intel who'd take over maintenance?
> > >
> > > As your patch is to LTO option handling, I think you want a review from
> > > Honza.
> >
> > Well, I'm actually not asking for review of the WIP patch, but rather
> > looking for someone to take on ownership/maintenance of the functionality
> > of Intel MIC offloading.
>
> That would be indeed greatly appreciated.
>
> Jakub

I don't konw this guy ilya.ver...@intel.com.
Do you know him/her, H.J?

-- 
BR,
Hongtao

Re: Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h

Re: /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_mask,

Re: /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_mask,

Re: /home/toon/compilers/gcc/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_mask,

Re: x86: making better use of vpternlog{d,q}

Re: [Intel SPR] Progress of GCC support for Intel SPR features

Re: _Float16-related failures on x86_64-apple-darwin

Re: Enable the vectorizer at -O2 for GCC 12

Re: Enable the vectorizer at -O2 for GCC 12

Re: Why vectorization didn't turn on by -O2

Re: Suboptimal code generated for __buitlin_ceil on AMD64 without SS4_4.1

Re: How to detect user uses -masm=intel?

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

Re: [Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

[Questions] Is there any bit in gimple/rtl to indicate this IR support fast-math or not?

Re: Hongtao Liu as x86 vectorization maintainer

Re: Help with PR97872

Re: Help with PR97872

Re: Help with PR97872

Re: [IMPORTANT] ChangeLog related changes

Re: [IMPORTANT] ChangeLog related changes

Re: [9/10 Regression] [PR87833] Intel MIC (emulated) offloading still broken (was: GCC 9.0.1 Status Report (2019-04-25))

24 matches

Site Navigation

Mail list logo

Footer information