Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-24 Thread Bin.Cheng via Gcc-patches
On Wed, Aug 25, 2021 at 11:26 AM guojiufu  wrote:
>
> On 2021-08-16 09:33, Bin.Cheng wrote:
> > On Wed, Aug 4, 2021 at 10:42 AM guojiufu 
> > wrote:
> >>
> ...
> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> b/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> new file mode 100644
> >> >> index 000..6eed3fa8aca
> >> >> --- /dev/null
> >> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> @@ -0,0 +1,63 @@
> >> >> +TYPE __attribute__ ((noinline))
> >> >> +foo_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
> >> >> +{
> >> >> +  for (l = L_BASE; n < l; l += C)
> >> >> +*a++ = *b++ + 1;
> >> >> +  return l;
> >> >> +}
> >> >> +
> >> >> +TYPE __attribute__ ((noinline))
> >> >> +bar_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
> >> >> +{
> >> >> +  for (l = L_BASE_DOWN; l < n; l -= C)
> > I noticed that both L_BASE and L_BASE_DOWN are defined as l, which
> > makes this test a bit confusing.  Could you clean the use of l, for
> > example, by using an auto var for the loop index invariable?
> > Otherwise the patch looks good to me.  Thanks very much for the work.
>
> Hi,
>
> Sorry for bothering you here.
> I feel this would be an approval (with the comment) already :)
>
> With the change code to make it a little clear as:
>TYPE i;
>for (i = l; n < i; i += C)
>
> it may be ok to commit the patch to the trunk, right?
Yes please.  Thanks again for working on this.

Thanks,
bin
>
> BR,
> Jiufu
>
> >
> > Thanks,
> > bin
> >> >> +*a++ = *b++ + 1;
> >> >> +  return l;
> >> >> +}
> >> >> +
> >> >> +int __attribute__ ((noinline)) neq (int a, int b) { return a != b; }
> >> >> +
> >> >> +int a[1000], b[1000];
> >> >> +int fail;
> >> >> +
> >> >> +int
> ...
> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> new file mode 100644
> >> >> index 000..94f6b99b893
> >> >> --- /dev/null
> >> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> @@ -0,0 +1,15 @@
> >> >> +/* { dg-require-effective-target vect_int } */
> >> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> >> +#define TYPE signed char
> >> >> +#define MIN -128
> >> >> +#define MAX 127
> >> >> +#define N_BASE (MAX - 32)
> >> >> +#define N_BASE_DOWN (MIN + 32)
> >> >> +
> >> >> +#define C 3
> >> >> +#define L_BASE l
> >> >> +#define L_BASE_DOWN l
> >> >> +


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Kewen.Lin via Gcc-patches
on 2021/8/25 上午6:14, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Aug 13, 2021 at 10:34:46AM +0800, Kewen.Lin wrote:
>> on 2021/8/12 下午11:10, Segher Boessenkool wrote:
 +  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
 +{
 +  machine_mode exp_mode = DImode;
 +  machine_mode exp_vmode = V2DImode;
 +  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
>>>
>>> "name"?  This should be "bif" or similar?
>>
>> Updated with name.
> 
> No, I meant "name" has no meaning other than it is wrong here :-)
> 
> It is an enum for which builtin to use here.  It has nothing to do with
> a name.  So it could be "enum rs6000_builtins bif" or whatever you want;
> short variable names are *good*, for many reasons, but they should not
> egregiously lie :-)
> 

Oops, sorry for the misunderstanding, will update it with "bif".

 +/* { dg-do run } */
 +/* { dg-require-effective-target lp64 } */
>>>
>>> Same here.  I suppose this uses builtins that do not exist on 32-bit?
>>
>> Yeah, those bifs which are guarded with lp64 in their cases are only
>> supported on 64-bit environment.
> 
> It is a pity we cannot use "powerpc64" here (that selector does not test
> what you would/could/should hope it tests...  Maybe someone can fix it
> some day?  The only real blocker to that is fixing up the current users
> of it, the rest is easy).
> 

If I got it right, there is only one test case using this selector:

gcc/testsuite/gcc.target/powerpc/darwin-longlong.c

The selector checking looks interesting to me, it has special option with
"-mcpu=G5" and seems to exclude all "aix" (didn't verify it yet).

I guess there still would be some efforts to re-direct those existing
cases which should use new "powerpc64_ok" instead of "lp64"?

> Expanding a bit...  You would expect (well, I do!  I did patches
> expecting this several times) this to mean "powerpc64_ok", but it in
> fact means "powerpc64_hw".  Maybe we should have selectors with those
> two names, and get rid of the current "powerpc64"?
> 

Yeah, it sounds good to have those two names just like some existing.

 +#define CHECK(name)   
 \
 +  __attribute__ ((optimize (1))) void check_##name () 
 \
>>>
>>> What is the attribute for, btw?  It seems fragile, but perhaps I do not
>>> understand the intention.
>>
>> It's to stop compiler from optimizing check functions with vectorization,
>> since the test point is to compare the results between scalar and vectorized
>> version.
> 
> So, add a comment for this as well please.
> 
> In general, in testcases you can do the dirtiest things, no problems at
> all, just document what you do why :-)
> 

OK, will add.

>> Thanks, v2 has been attached by addressing Bill's and your comments.  :)
> 
> Looks good.  Just fix that "name" thing, and it is okay for trunk.
> Thanks!
> 

Thanks for the review!

BR,
Kewen


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Kewen.Lin via Gcc-patches
on 2021/8/25 上午5:56, Segher Boessenkool wrote:
> On Fri, Aug 13, 2021 at 11:18:46AM +0800, Kewen.Lin wrote:
>> on 2021/8/12 下午11:51, Segher Boessenkool wrote:
>>> It is a bad idea to initialise things unnecessary: it hinders many
>>> optimisations, but much more importantly, it silences warnings without
>>> fixing the problem.
>>
>> OK, I've made it uninitialized in v2. :-)  I believe the context here is 
>> simple
>> and the uninit-ed var detector can easily catch and warn the bad thing in 
>> future.
> 
> And those warnings generally are for "MAY BE used uninitialised",
> anyway.  They will warn :-)
> 
> (When the warning says "IS used uninitialised" the compiler should be
> sure about that!)
> 
>> Sorry for chasing dead ends, I don't follow how it can hinder optimizations 
>> here,
>> IIUC it would be optimized as a dead store here?
> 
> When the compiler is not sure if something needs initialisation or not
> it cannot remove actually superfluous initialisation.  Such cases are
> always too complicated code, so that should be fixed, not silenced :-)
> 

aha, you meant complicated code, got it.  :)

>> As to the warning, although
>> there is no warning, I'd expect it causes ICE since the init-ed bif name 
>> isn't
>> reasonable for generation.  Wouldn't it be better than warning?  Sometimes we
>> don't have a proper value for initialization, I agree it should be better to
>> just leave it be, but IMHO it isn't the case here.  :)
> 
> ICEing is always wrong.  A user should never see an ICE (not counting
> "sorry"s as ICEs here -- not that those are good, but they tell the user
> exactly what is going on).
> 

Yeah, but here I was expecting the ICE happens when GCC developers are testing
the newly added bif supports.  :)


BR,
Kewen


Re: [PATCH v2] rs6000: Add vec_unpacku_{hi,lo}_v4si

2021-08-24 Thread Kewen.Lin via Gcc-patches
on 2021/8/24 下午9:02, Segher Boessenkool wrote:
> Hi Ke Wen,
> 
> On Mon, Aug 09, 2021 at 10:53:00AM +0800, Kewen.Lin wrote:
>> on 2021/8/6 下午9:10, Bill Schmidt wrote:
>>> On 8/4/21 9:06 PM, Kewen.Lin wrote:
 The existing vec_unpacku_{hi,lo} supports emulated unsigned
 unpacking for short and char but misses the support for int.
 This patch adds the support for vec_unpacku_{hi,lo}_v4si.
> 
>>  * config/rs6000/altivec.md (vec_unpacku_hi_v16qi): Remove.
>>  (vec_unpacku_hi_v8hi): Likewise.
>>  (vec_unpacku_lo_v16qi): Likewise.
>>  (vec_unpacku_lo_v8hi): Likewise.
>>  (vec_unpacku_hi_): New define_expand.
>>  (vec_unpacku_lo_): Likewise.
> 
>> -(define_expand "vec_unpacku_hi_v16qi"
>> -  [(set (match_operand:V8HI 0 "register_operand" "=v")
>> -(unspec:V8HI [(match_operand:V16QI 1 "register_operand" "v")]
>> - UNSPEC_VUPKHUB))]
>> -  "TARGET_ALTIVEC"  
>> -{  
>> -  rtx vzero = gen_reg_rtx (V8HImode);
>> -  rtx mask = gen_reg_rtx (V16QImode);
>> -  rtvec v = rtvec_alloc (16);
>> -  bool be = BYTES_BIG_ENDIAN;
>> -   
>> -  emit_insn (gen_altivec_vspltish (vzero, const0_rtx));
>> -   
>> -  RTVEC_ELT (v,  0) = gen_rtx_CONST_INT (QImode, be ? 16 :  7);
>> -  RTVEC_ELT (v,  1) = gen_rtx_CONST_INT (QImode, be ?  0 : 16);
>> -  RTVEC_ELT (v,  2) = gen_rtx_CONST_INT (QImode, be ? 16 :  6);
>> -  RTVEC_ELT (v,  3) = gen_rtx_CONST_INT (QImode, be ?  1 : 16);
>> -  RTVEC_ELT (v,  4) = gen_rtx_CONST_INT (QImode, be ? 16 :  5);
>> -  RTVEC_ELT (v,  5) = gen_rtx_CONST_INT (QImode, be ?  2 : 16);
>> -  RTVEC_ELT (v,  6) = gen_rtx_CONST_INT (QImode, be ? 16 :  4);
>> -  RTVEC_ELT (v,  7) = gen_rtx_CONST_INT (QImode, be ?  3 : 16);
>> -  RTVEC_ELT (v,  8) = gen_rtx_CONST_INT (QImode, be ? 16 :  3);
>> -  RTVEC_ELT (v,  9) = gen_rtx_CONST_INT (QImode, be ?  4 : 16);
>> -  RTVEC_ELT (v, 10) = gen_rtx_CONST_INT (QImode, be ? 16 :  2);
>> -  RTVEC_ELT (v, 11) = gen_rtx_CONST_INT (QImode, be ?  5 : 16);
>> -  RTVEC_ELT (v, 12) = gen_rtx_CONST_INT (QImode, be ? 16 :  1);
>> -  RTVEC_ELT (v, 13) = gen_rtx_CONST_INT (QImode, be ?  6 : 16);
>> -  RTVEC_ELT (v, 14) = gen_rtx_CONST_INT (QImode, be ? 16 :  0);
>> -  RTVEC_ELT (v, 15) = gen_rtx_CONST_INT (QImode, be ?  7 : 16);
>> -
>> -  emit_insn (gen_vec_initv16qiqi (mask, gen_rtx_PARALLEL (V16QImode, v)));
>> -  emit_insn (gen_vperm_v16qiv8hi (operands[0], operands[1], vzero, mask));
>> -  DONE;
>> -})
> 
> So I wonder if all this still generates good code.  The unspecs cannot
> be optimised properly, the RTL can (in principle, anyway: it is possible
> it makes more opportunities to use unpack etc. insns invisible than that
> it helps over unspec.  This needs to be tested, and the usual idioms
> need testcases, is that what you add here?  (/me reads on...)
> 

Yeah, for existing char/short, it generates better codes with vector
merging high/low instead of permutation, by saving the cost for the
permutation control vector (space in constant area as well as the cost
to initialize it in prologue).  The iterator writing makes it concise
and also add the missing "int" support.  The associated test cases are
to verify new generated assembly and runtime result.

>> +  if (BYTES_BIG_ENDIAN)
>> +emit_insn (gen_altivec_vmrgh (res, vzero, op1));
>> +  else
>> +emit_insn (gen_altivec_vmrgl (res, op1, vzero));
> 
> Ah, so it is *not* using unspecs?  Excellent.
> 
> Okay for trunk.  Thank you!
> 

Thanks for the review!  Committed in r12-3134.


BR,
Kewen


Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-24 Thread guojiufu via Gcc-patches

On 2021-08-16 09:33, Bin.Cheng wrote:
On Wed, Aug 4, 2021 at 10:42 AM guojiufu  
wrote:



...

>> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.inc
>> b/gcc/testsuite/gcc.dg/vect/pr101145.inc
>> new file mode 100644
>> index 000..6eed3fa8aca
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.inc
>> @@ -0,0 +1,63 @@
>> +TYPE __attribute__ ((noinline))
>> +foo_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
>> +{
>> +  for (l = L_BASE; n < l; l += C)
>> +*a++ = *b++ + 1;
>> +  return l;
>> +}
>> +
>> +TYPE __attribute__ ((noinline))
>> +bar_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
>> +{
>> +  for (l = L_BASE_DOWN; l < n; l -= C)

I noticed that both L_BASE and L_BASE_DOWN are defined as l, which
makes this test a bit confusing.  Could you clean the use of l, for
example, by using an auto var for the loop index invariable?
Otherwise the patch looks good to me.  Thanks very much for the work.


Hi,

Sorry for bothering you here.
I feel this would be an approval (with the comment) already :)

With the change code to make it a little clear as:
  TYPE i;
  for (i = l; n < i; i += C)

it may be ok to commit the patch to the trunk, right?

BR,
Jiufu



Thanks,
bin

>> +*a++ = *b++ + 1;
>> +  return l;
>> +}
>> +
>> +int __attribute__ ((noinline)) neq (int a, int b) { return a != b; }
>> +
>> +int a[1000], b[1000];
>> +int fail;
>> +
>> +int

...

>> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145_1.c
>> b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
>> new file mode 100644
>> index 000..94f6b99b893
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
>> @@ -0,0 +1,15 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
>> +#define TYPE signed char
>> +#define MIN -128
>> +#define MAX 127
>> +#define N_BASE (MAX - 32)
>> +#define N_BASE_DOWN (MIN + 32)
>> +
>> +#define C 3
>> +#define L_BASE l
>> +#define L_BASE_DOWN l
>> +


[PATCH] Adjust testcases to avoid new failures brought by r12-3108 when compiled w -march=cascadelake.

2021-08-24 Thread liuhongt via Gcc-patches
  Pushed to trunk as an obvious fix.

gcc/testsuite/ChangeLog:

PR target/101989
* gcc.target/i386/avx2-shiftqihi-constant-1.c: Add -mno-avx512f.
* gcc.target/i386/sse2-shiftqihi-constant-1.c: Add -mno-avx
---
 gcc/testsuite/gcc.target/i386/avx2-shiftqihi-constant-1.c | 2 +-
 gcc/testsuite/gcc.target/i386/sse2-shiftqihi-constant-1.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.target/i386/avx2-shiftqihi-constant-1.c 
b/gcc/testsuite/gcc.target/i386/avx2-shiftqihi-constant-1.c
index 72065039581..801f570decc 100644
--- a/gcc/testsuite/gcc.target/i386/avx2-shiftqihi-constant-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx2-shiftqihi-constant-1.c
@@ -1,6 +1,6 @@
 /* PR target/95524 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mavx2" } */
+/* { dg-options "-O2 -mavx2 -mno-avx512f" } */
 /* { dg-final { scan-assembler-times "vpand\[^\n\]*%ymm" 3 } }  */
 typedef char v32qi  __attribute__ ((vector_size (32)));
 typedef unsigned char v32uqi  __attribute__ ((vector_size (32)));
diff --git a/gcc/testsuite/gcc.target/i386/sse2-shiftqihi-constant-1.c 
b/gcc/testsuite/gcc.target/i386/sse2-shiftqihi-constant-1.c
index f1c68cb2972..015450f8219 100644
--- a/gcc/testsuite/gcc.target/i386/sse2-shiftqihi-constant-1.c
+++ b/gcc/testsuite/gcc.target/i386/sse2-shiftqihi-constant-1.c
@@ -1,6 +1,6 @@
 /* PR target/95524 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -msse2" } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
 /* { dg-final { scan-assembler-times "pand\[^\n\]*%xmm" 3 { xfail *-*-* } } } 
*/
 typedef char v16qi  __attribute__ ((vector_size (16)));
 typedef unsigned char v16uqi  __attribute__ ((vector_size (16)));
-- 
2.27.0



Re: [PATCH] [i386] Optimize (a & b) | (c & ~b) to vpternlog instruction.

2021-08-24 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 24, 2021 at 9:11 PM Bernhard Reutner-Fischer
 wrote:
>
> On Tue, 24 Aug 2021 17:53:27 +0800
> Hongtao Liu via Gcc-patches  wrote:
>
> > On Tue, Aug 24, 2021 at 9:36 AM liuhongt  wrote:
> > >
> > > Also optimize below 3 forms to vpternlog, op1, op2, op3 are
> > > register_operand or unary_p as (not reg)
>
> > > gcc/ChangeLog:
> > >
> > > PR target/101989
> > > * config/i386/i386-protos.h
> > > (ix86_strip_reg_or_notreg_operand): New declare.
>
> "New declaration."
>
> > > * config/i386/i386.c (ix86_rtx_costs): Define cost for
> > > UNSPEC_VTERNLOG.
>
> I do not see a considerable amount of VTERNLOG in the docs i have here.
> Is there a P missing in vPternlog?
The output assembly is vpternlog, and the internal pattern name is
originally vternlog (not clear why it is not called vpternlog, perhaps
the abbreviation of vector ternary logic), I added the new
define_insn_and_split just to keep in line with the original name.
>
> > > (ix86_strip_reg_or_notreg_operand): New function.
> > Push to trunk by changing ix86_strip_reg_or_notreg_operand to macro,
> > function call seems too inefficient for the simple strip unary.
> > > * config/i386/predicates.md (reg_or_notreg_operand): New
> > > predicate.
> > > * config/i386/sse.md (*_vternlog_all): New 
> > > define_insn.
> > > (*_vternlog_1): New pre_reload
> > > define_insn_and_split.
> > > (*_vternlog_2): Ditto.
> > > (*_vternlog_3): Ditto.
>
> at least the above 3 insn_and_split do have a 'p' in the md.
> thanks,
> > > (any_logic1,any_logic2): New code iterator.
> > > (logic_op): New code attribute.
> > > (ternlogsuffix): Extend to VNxDF and VNxSF.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR target/101989
> > > * gcc.target/i386/pr101989-1.c: New test.
> > > * gcc.target/i386/pr101989-2.c: New test.
> > > * gcc.target/i386/avx512bw-shiftqihi-constant-1.c: Adjust 
> > > testcase.
> > > ---
> > >  gcc/config/i386/i386-protos.h |   1 +
> > >  gcc/config/i386/i386.c|  13 +
> > >  gcc/config/i386/predicates.md |   7 +
> > >  gcc/config/i386/sse.md| 234 ++
> > >  .../i386/avx512bw-shiftqihi-constant-1.c  |   4 +-
> > >  gcc/testsuite/gcc.target/i386/pr101989-1.c|  51 
> > >  gcc/testsuite/gcc.target/i386/pr101989-2.c| 102 
> > >  7 files changed, 410 insertions(+), 2 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-2.c
> > >
> > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > index 2fd13074c81..2bdaadcf4f3 100644
> > > --- a/gcc/config/i386/i386-protos.h
> > > +++ b/gcc/config/i386/i386-protos.h
> > > @@ -60,6 +60,7 @@ extern rtx standard_80387_constant_rtx (int);
> > >  extern int standard_sse_constant_p (rtx, machine_mode);
> > >  extern const char *standard_sse_constant_opcode (rtx_insn *, rtx *);
> > >  extern bool ix86_standard_x87sse_constant_load_p (const rtx_insn *, rtx);
> > > +extern rtx ix86_strip_reg_or_notreg_operand (rtx);
> > >  extern bool ix86_pre_reload_split (void);
> > >  extern bool symbolic_reference_mentioned_p (rtx);
> > >  extern bool extended_reg_mentioned_p (rtx);
> > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > index 46844fab08f..a69225ccc81 100644
> > > --- a/gcc/config/i386/i386.c
> > > +++ b/gcc/config/i386/i386.c
> > > @@ -5236,6 +5236,14 @@ ix86_standard_x87sse_constant_load_p (const 
> > > rtx_insn *insn, rtx dst)
> > >return true;
> > >  }
> > >
> > > +/* Returns true if INSN can be transformed from a memory load
> > > +   to a supported FP constant load.  */
> > > +rtx
> > > +ix86_strip_reg_or_notreg_operand (rtx op)
> > > +{
> > > +  return UNARY_P (op) ? XEXP (op, 0) : op;
> > > +}
> > > +
> > >  /* Predicate for pre-reload splitters with associated instructions,
> > > which can match any time before the split1 pass (usually combine),
> > > then are unconditionally split in that pass and should not be
> > > @@ -20544,6 +20552,11 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> > > outer_code_i, int opno,
> > >  case UNSPEC:
> > >if (XINT (x, 1) == UNSPEC_TP)
> > > *total = 0;
> > > +  else if (XINT(x, 1) == UNSPEC_VTERNLOG)
> > > +   {
> > > + *total = cost->sse_op;
> > > + return true;
> > > +   }
> > >return false;
> > >
> > >  case VEC_SELECT:
> > > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> > > index 9321f332ef9..df5acb425d4 100644
> > > --- a/gcc/config/i386/predicates.md
> > > +++ b/gcc/config/i386/predicates.md
> > > @@ -1044,6 +1044,13 @@ (define_predicate "reg_or_pm1_operand"
> > > (ior (match_test "op == const1_rtx")
> > >  (match_test "op 

Re: Ping: [PATCH] diagnostics: Support for -finput-charset [PR93067]

2021-08-24 Thread Lewis Hyatt via Gcc-patches
On Tue, Aug 24, 2021 at 6:51 PM David Malcolm  wrote:
>
> On Tue, 2021-08-24 at 08:17 -0400, Lewis Hyatt wrote:
> > Hello-
> >
> > I thought it might be a good time to check on this patch please?
> > Thanks!
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html
> >
> > -Lewis
>
> I went through that latest version of the patch and have no further
> suggestions - I like the changes you made to incorporate the changes I
> had made to input.c.
>
> The latest version of the patch is OK for trunk.
>
> It might be an idea to rebase it and retest it before pushing it, to
> make sure nothing significant has changed in the last few weeks.
>
> Thanks for your work on this, and sorry again for the delay in
> reviewing it.
>
> Dave
>
>

OK great, thanks for your time. I will push after retesting.

BTW, do you think it would be worthwhile to work on the other half of
encoding support, i.e. translating from UTF-8 to the user's locale,
when outputting diagnostics? I have probably 90% of a patch that does
this, however it complexifies things a bit and I am not sure if it is
really worth the trouble. What is rather manageable (that my patch in
progress does now) is to replace non-translatable characters with
something like UCN escapes. What is not so easy, is to do this and
preserve the alignment of carets and label lines and such... this
requires making the display width of a character also
locale-dependent, which concept doesn't exist currently. Adding that
feels like a lot of complication for what would be a little-used
feature... Anyway, if you think a patch that does the translation
without preserving the alignment would be useful, I could finish it up
and send it. Otherwise I was kinda inclined to forget about it.
Thanks!

-Lewis


Re: Ping: [PATCH] diagnostics: Support for -finput-charset [PR93067]

2021-08-24 Thread David Malcolm via Gcc-patches
On Tue, 2021-08-24 at 08:17 -0400, Lewis Hyatt wrote:
> Hello-
> 
> I thought it might be a good time to check on this patch please?
> Thanks!
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html
> 
> -Lewis

I went through that latest version of the patch and have no further
suggestions - I like the changes you made to incorporate the changes I
had made to input.c.

The latest version of the patch is OK for trunk.

It might be an idea to rebase it and retest it before pushing it, to
make sure nothing significant has changed in the last few weeks.

Thanks for your work on this, and sorry again for the delay in
reviewing it.

Dave




Re: [PATCH, rs6000] Disable gimple fold for float or double vec_minmax when fast-math is not set

2021-08-24 Thread Segher Boessenkool
Hi!

On Tue, Aug 24, 2021 at 03:04:26PM -0500, Bill Schmidt wrote:
> On 8/24/21 3:52 AM, HAO CHEN GUI wrote:
> Thanks for this patch!  In the future, if you can put your ChangeLog and 
> patch inline in your post, it makes it easier to review.  (Otherwise we 
> have to manually copy it into our response and manipulate it to look 
> quoted, etc.)

It is encoded even, making it impossible to easily apply the patch, etc.

> >diff --git a/gcc/config/rs6000/rs6000-call.c 
> >b/gcc/config/rs6000/rs6000-call.c index b4e13af4dc6..90527734ceb 
> >100644 --- a/gcc/config/rs6000/rs6000-call.c +++ 
> >b/gcc/config/rs6000/rs6000-call.c @@ -12159,6 +12159,11 @@ 
> >rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi) return true; /* 
> >flavors of vec_min. */ case VSX_BUILTIN_XVMINDP: + case 

format=flawed :-(


Segher


[PATCH] AIX SYSTEM_IMPLICIT_EXTERN_C

2021-08-24 Thread David Edelsohn via Gcc-patches
AIX 7.3 system headers are C++ safe and GCC no longer needs to define
SYSTEM_IMPLICIT_EXTERN_C for AIX 7.3.  This patch moves the definition
from aix.h to the individual OS-level configuration files and does not
define the macro for AIX 7.3.

The patch also corrects the definition of TARGET_AIX_VERSION to 73.

Bootstrapped on powerpc-ibm-aix7.2.3.0 and powerpc-ibm-aix7.3.0.0.

Thanks, David

gcc/ChangeLog:
* config/rs6000/aix.h (SYSTEM_IMPLICIT_EXTERN_C): Delete.
* config/rs6000/aix71.h (SYSTEM_IMPLICIT_EXTERN_C): Define.
* config/rs6000/aix72.h (SYSTEM_IMPLICIT_EXTERN_C): Define.
* config/rs6000/aix73.h (TARGET_AIX_VERSION): Increase to 73.

diff --git a/gcc/config/rs6000/aix.h b/gcc/config/rs6000/aix.h
index 662785cc7db..0f4d8cb2dc8 100644
--- a/gcc/config/rs6000/aix.h
+++ b/gcc/config/rs6000/aix.h
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-3.0-or-later
 /* Definitions of target machine for GNU compiler,
for IBM RS/6000 POWER running AIX.
Copyright (C) 2000-2021 Free Software Foundation, Inc.
@@ -23,9 +24,6 @@
 #undef  TARGET_AIX
 #define TARGET_AIX 1

-/* System headers are not C++-aware.  */
-#define SYSTEM_IMPLICIT_EXTERN_C 1
-
 /* Linux64.h wants to redefine TARGET_AIX based on -m64, but it can't be used
in the #if conditional in options-default.h, so provide another macro.  */
 #undef  TARGET_AIX_OS
diff --git a/gcc/config/rs6000/aix71.h b/gcc/config/rs6000/aix71.h
index 38cfa9e158a..1bc1560c496 100644
--- a/gcc/config/rs6000/aix71.h
+++ b/gcc/config/rs6000/aix71.h
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-3.0-or-later
 /* Definitions of target machine for GNU compiler,
for IBM RS/6000 POWER running AIX V7.1.
Copyright (C) 2002-2021 Free Software Foundation, Inc.
@@ -268,6 +269,9 @@ extern long long intatoll(const char *);
 #define SET_CMODEL(opt) do {} while (0)
 #endif

+/* System headers are not C++-aware.  */
+#define SYSTEM_IMPLICIT_EXTERN_C 1
+
 /* This target defines SUPPORTS_WEAK and TARGET_ASM_NAMED_SECTION,
but does not have crtbegin/end.  */

diff --git a/gcc/config/rs6000/aix72.h b/gcc/config/rs6000/aix72.h
index a497a7d8541..cca64f14f3a 100644
--- a/gcc/config/rs6000/aix72.h
+++ b/gcc/config/rs6000/aix72.h
@@ -270,6 +270,9 @@ extern long long intatoll(const char *);
 #define SET_CMODEL(opt) do {} while (0)
 #endif

+/* System headers are not C++-aware.  */
+#define SYSTEM_IMPLICIT_EXTERN_C 1
+
 /* This target defines SUPPORTS_WEAK and TARGET_ASM_NAMED_SECTION,
but does not have crtbegin/end.  */

diff --git a/gcc/config/rs6000/aix73.h b/gcc/config/rs6000/aix73.h
index c707c7e76b6..f0ca1a55e5d 100644
--- a/gcc/config/rs6000/aix73.h
+++ b/gcc/config/rs6000/aix73.h
@@ -274,7 +274,7 @@ extern long long intatoll(const char *);
 /* This target defines SUPPORTS_WEAK and TARGET_ASM_NAMED_SECTION,
but does not have crtbegin/end.  */

-#define TARGET_AIX_VERSION 72
+#define TARGET_AIX_VERSION 73

 /* AIX 7.2 supports DWARF3+ debugging.  */
 #define DWARF2_DEBUGGING_INFO 1


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Segher Boessenkool
Hi!

On Fri, Aug 13, 2021 at 10:34:46AM +0800, Kewen.Lin wrote:
> on 2021/8/12 下午11:10, Segher Boessenkool wrote:
> >> +  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
> >> +{
> >> +  machine_mode exp_mode = DImode;
> >> +  machine_mode exp_vmode = V2DImode;
> >> +  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
> > 
> > "name"?  This should be "bif" or similar?
> 
> Updated with name.

No, I meant "name" has no meaning other than it is wrong here :-)

It is an enum for which builtin to use here.  It has nothing to do with
a name.  So it could be "enum rs6000_builtins bif" or whatever you want;
short variable names are *good*, for many reasons, but they should not
egregiously lie :-)

> >> +/* { dg-do run } */
> >> +/* { dg-require-effective-target lp64 } */
> > 
> > Same here.  I suppose this uses builtins that do not exist on 32-bit?
> 
> Yeah, those bifs which are guarded with lp64 in their cases are only
> supported on 64-bit environment.

It is a pity we cannot use "powerpc64" here (that selector does not test
what you would/could/should hope it tests...  Maybe someone can fix it
some day?  The only real blocker to that is fixing up the current users
of it, the rest is easy).

Expanding a bit...  You would expect (well, I do!  I did patches
expecting this several times) this to mean "powerpc64_ok", but it in
fact means "powerpc64_hw".  Maybe we should have selectors with those
two names, and get rid of the current "powerpc64"?

> >> +#define CHECK(name)   
> >> \
> >> +  __attribute__ ((optimize (1))) void check_##name () 
> >> \
> > 
> > What is the attribute for, btw?  It seems fragile, but perhaps I do not
> > understand the intention.
> 
> It's to stop compiler from optimizing check functions with vectorization,
> since the test point is to compare the results between scalar and vectorized
> version.

So, add a comment for this as well please.

In general, in testcases you can do the dirtiest things, no problems at
all, just document what you do why :-)

> Thanks, v2 has been attached by addressing Bill's and your comments.  :)

Looks good.  Just fix that "name" thing, and it is okay for trunk.
Thanks!


Segher


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Segher Boessenkool
On Fri, Aug 13, 2021 at 11:18:46AM +0800, Kewen.Lin wrote:
> on 2021/8/12 下午11:51, Segher Boessenkool wrote:
> > It is a bad idea to initialise things unnecessary: it hinders many
> > optimisations, but much more importantly, it silences warnings without
> > fixing the problem.
> 
> OK, I've made it uninitialized in v2. :-)  I believe the context here is 
> simple
> and the uninit-ed var detector can easily catch and warn the bad thing in 
> future.

And those warnings generally are for "MAY BE used uninitialised",
anyway.  They will warn :-)

(When the warning says "IS used uninitialised" the compiler should be
sure about that!)

> Sorry for chasing dead ends, I don't follow how it can hinder optimizations 
> here,
> IIUC it would be optimized as a dead store here?

When the compiler is not sure if something needs initialisation or not
it cannot remove actually superfluous initialisation.  Such cases are
always too complicated code, so that should be fixed, not silenced :-)

> As to the warning, although
> there is no warning, I'd expect it causes ICE since the init-ed bif name isn't
> reasonable for generation.  Wouldn't it be better than warning?  Sometimes we
> don't have a proper value for initialization, I agree it should be better to
> just leave it be, but IMHO it isn't the case here.  :)

ICEing is always wrong.  A user should never see an ICE (not counting
"sorry"s as ICEs here -- not that those are good, but they tell the user
exactly what is going on).


Segher


[Committed] PR middle-end/102031: Fix typo/mistake in simplify_truncation patch

2021-08-24 Thread Roger Sayle

My apologies again.  My patch to simplify truncations of SUBREGs in
simplify-rtx.c contained an error where I'd accidentally compared
against a mode instead of the precision of that mode.  Grr!  It even
survived regression testing on two platforms.  Fixed below, and
committed as obvious, after a full "make bootstrap" and "make -k check"
on x86_64-pc-linux-gnu with no new regressions.


2021-08-24  Roger Sayle  

gcc/ChangeLog
PR middle-end/102031
* simplify-rtx.c (simplify_truncation): When comparing precisions
use "subreg_prec" variable, not "subreg_mode".

Roger
--

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 8eea9fb..c81e27e 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -841,7 +841,7 @@ simplify_context::simplify_truncation (machine_mode mode, 
rtx op,
{
  unsigned int int_op_prec = GET_MODE_PRECISION (int_op_mode);
  unsigned int subreg_prec = GET_MODE_PRECISION (subreg_mode);
- if (int_op_prec > subreg_mode)
+ if (int_op_prec > subreg_prec)
{
  if (int_mode == subreg_mode)
return SUBREG_REG (op);
@@ -851,7 +851,7 @@ simplify_context::simplify_truncation (machine_mode mode, 
rtx op,
}
  /* Simplification of (truncate:A (subreg:B X:C 0)) where
 A is narrower than B and B is narrower than C.  */
- else if (int_op_prec < subreg_mode
+ else if (int_op_prec < subreg_prec
   && GET_MODE_PRECISION (int_mode) < int_op_prec)
return simplify_gen_unary (TRUNCATE, int_mode,
   SUBREG_REG (op), subreg_mode);


Re: [PATCH] Change illegitimate constant into memref of constant pool in change_zero_ext.

2021-08-24 Thread Segher Boessenkool
Hi!

On Tue, Aug 24, 2021 at 04:55:30PM +0800, liuhongt wrote:
>   This patch extend change_zero_ext to change illegitimate constant
> into constant pool, this will enable simplification of below:

It should be in a separate function.  recog_for_combine will call both.
But not both for the same RTL!  This is important.  Originally, combine
tried only one thing for every combination of input instructions it got.
Every extra attempt causes quite a bit more garbage (not to mention it
takes time as well, recog isn't super cheap), so we should try to not
make recog_for_combine exponential in the variants it tries..

And of course the function name should always be descriptive of what a
function does :-)

>  change_zero_ext (rtx pat)
> @@ -11417,6 +11418,23 @@ change_zero_ext (rtx pat)
>  {
>rtx x = **iter;
>scalar_int_mode mode, inner_mode;
> +  machine_mode const_mode = GET_MODE (x);
> +
> +  /* Change illegitimate constant into memref of constant pool.  */
> +  if (CONSTANT_P (x)
> +   && !const_vec_duplicate_p (x)

This is x86-specific?  It makes no sense in generic code, anyway.  Or if
it does it needs a big fat comment :-)

> +   && const_mode != BLKmode
> +   && GET_CODE (x) != HIGH
> +   && GET_MODE_SIZE (const_mode).is_constant ()
> +   && !targetm.legitimate_constant_p (const_mode, x)
> +   && !targetm.cannot_force_const_mem (const_mode, x))

You should only test that it did not recog, and then force a constant
to memory.  You do not want to do this for every constant (rotate by 42
won't likely match better if you force 42 to memory) so some
sophistication will help here, but please do not make it target-
specific.

> + {
> +   x = force_const_mem (GET_MODE (x), x);

That mode is const_mode.


Ideally you will have some example where it benefits some other target,
too.  Running recog twice for a big fraction of all combine attempts,
for no benefit at all, is not a good idea.  The zext* thing is there
because combine *itself* creates a lot of extra zext*, whether those
exist for the target or not.  So this isn't obvious precedent (and that
wouldn't mean it is a good idea anyway ;-) )


Segher


[PATCH] PR fortran/93834 - [9/10/11/12 Regression] ICE in trans_caf_is_present, at fortran/trans-intrinsic.c:8469

2021-08-24 Thread Harald Anlauf via Gcc-patches
Dear Fortranners,

here's a pretty obvious one: we didn't properly check the arguments
for intrinsics when these had to be ALLOCATABLE and in the case that
argument was a coarray object.  Simple solution: just reuse a check
that was used for pointer etc.

Regtested on x86_64-pc-linux-gnu.  OK for mainline / backports?

Thanks,
Harald


Fortran - extend allocatable_check to coarrays

gcc/fortran/ChangeLog:

PR fortran/93834
* check.c (allocatable_check): A coindexed array element is not an
allocatable object.

gcc/testsuite/ChangeLog:

PR fortran/93834
* gfortran.dg/coarray_allocated.f90: New test.

diff --git a/gcc/fortran/check.c b/gcc/fortran/check.c
index 851af1b30dc..a293a7b9592 100644
--- a/gcc/fortran/check.c
+++ b/gcc/fortran/check.c
@@ -983,6 +983,14 @@ allocatable_check (gfc_expr *e, int n)
   return false;
 }

+  if (attr.codimension && gfc_is_coindexed (e))
+{
+  gfc_error ("%qs argument of %qs intrinsic at %L shall not be "
+		 "coindexed", gfc_current_intrinsic_arg[n]->name,
+		 gfc_current_intrinsic, >where);
+  return false;
+}
+
   return true;
 }

diff --git a/gcc/testsuite/gfortran.dg/coarray_allocated.f90 b/gcc/testsuite/gfortran.dg/coarray_allocated.f90
new file mode 100644
index 000..70e865c94ac
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/coarray_allocated.f90
@@ -0,0 +1,9 @@
+! { dg-do compile }
+! { dg-options "-fcoarray=lib" }
+! PR fortran/93834 - ICE in trans_caf_is_present
+
+program p
+  integer, allocatable :: a[:]
+  print *, allocated (a)
+  print *, allocated (a[1]) ! { dg-error "shall not be coindexed" }
+end


Re: [PATCH, rs6000] Disable gimple fold for float or double vec_minmax when fast-math is not set

2021-08-24 Thread Bill Schmidt via Gcc-patches

Hi Hao Chen,

On 8/24/21 3:52 AM, HAO CHEN GUI wrote:

Hi

     The patch disables gimple fold for float or double vec_min/max
builtin when fast-math is not set. Two test cases are added to verify
the patch.

     The attachments are the patch diff and change log file.

     Bootstrapped and tested on powerpc64le-linux with no regressions. Is
this okay for trunk? Any recommendations? Thanks a lot.

Thanks for this patch!  In the future, if you can put your ChangeLog and 
patch inline in your post, it makes it easier to review.  (Otherwise we 
have to manually copy it into our response and manipulate it to look 
quoted, etc.)


Your ChangeLog isn't formatted correctly.  It should look like this:

2021-08-24  Hao Chen Gui  

gcc/
* config/rs6000/rs6000-call.c (rs6000_gimple_fold_builtin): Modify the
VSX_BUILTIN_XVMINDP, ALTIVEC_BUILTIN_VMINFP, VSX_BUILTIN_XVMAXDP, and
ALTIVEC_BUILTIN_VMAXFP expansions.

gcc/testsuite/
* gcc.target/powerpc/vec-minmax-1.c: New test.
* gcc.target/powerpc/vec-minmax-2.c: Likewise.

You forgot the committer/timestamp line and the ChangeLog location 
lines.  (The headers like "gcc/" ensure that the automated processing 
will record your entries in the ChangeLog at the correct location in the 
source tree.)  Note also that the colon ":" always follows the ending 
parenthesis when there's a function name listed.  Please review 
https://gcc.gnu.org/codingconventions.html#ChangeLogs.


diff --git a/gcc/config/rs6000/rs6000-call.c 
b/gcc/config/rs6000/rs6000-call.c index b4e13af4dc6..90527734ceb 
100644 --- a/gcc/config/rs6000/rs6000-call.c +++ 
b/gcc/config/rs6000/rs6000-call.c @@ -12159,6 +12159,11 @@ 
rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi) return true; /* 
flavors of vec_min. */ case VSX_BUILTIN_XVMINDP: + case 
ALTIVEC_BUILTIN_VMINFP: + if (!flag_finite_math_only || 
flag_signed_zeros) + return false; + /* Fall through to MIN_EXPR. */ + 
gcc_fallthrough (); case P8V_BUILTIN_VMINSD: case P8V_BUILTIN_VMINUD: 
case ALTIVEC_BUILTIN_VMINSB: @@ -12167,7 +12172,6 @@ 
rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi) case 
ALTIVEC_BUILTIN_VMINUB: case ALTIVEC_BUILTIN_VMINUH: case 
ALTIVEC_BUILTIN_VMINUW: - case ALTIVEC_BUILTIN_VMINFP: arg0 = 
gimple_call_arg (stmt, 0); arg1 = gimple_call_arg (stmt, 1); lhs = 
gimple_call_lhs (stmt); @@ -12177,6 +12181,11 @@ 
rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi) return true; /* 
flavors of vec_max. */ case VSX_BUILTIN_XVMAXDP: + case 
ALTIVEC_BUILTIN_VMAXFP: + if (!flag_finite_math_only || 
flag_signed_zeros) + return false; + /* Fall through to MAX_EXPR. */ + 
gcc_fallthrough (); case P8V_BUILTIN_VMAXSD: case P8V_BUILTIN_VMAXUD: 
case ALTIVEC_BUILTIN_VMAXSB: @@ -12185,7 +12194,6 @@ 
rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi) case 
ALTIVEC_BUILTIN_VMAXUB: case ALTIVEC_BUILTIN_VMAXUH: case 
ALTIVEC_BUILTIN_VMAXUW: - case ALTIVEC_BUILTIN_VMAXFP: arg0 = 
gimple_call_arg (stmt, 0); arg1 = gimple_call_arg (stmt, 1); lhs = 
gimple_call_lhs (stmt); diff --git 
a/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c 
b/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c new file mode 100644 
index 000..9782d1b9308 --- /dev/null +++ 
b/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c @@ -0,0 +1,51 @@ +/* 
{ dg-do compile { target { powerpc64le-*-* } } } */ +/* { 
dg-require-effective-target powerpc_p9vector_ok } */ +/* { dg-options 
"-O2 -mdejagnu-cpu=power9" } */ +/* { dg-final { scan-assembler-times 
{\mxvmax[ds]p\M} 2 } } */ +/* { dg-final { scan-assembler-times 
{\mxvmin[ds]p\M} 2 } } */ 


This is pedantic, but...  You want exactly one each of xvmaxdp, xvmaxsp, 
xvmindp, and xvminsp,
so please replace this with four lines with { scan-assembler-times {...} 1 }.  
Thanks. :-)

Otherwise this looks fine to me.  I can't approve, but recommend the 
maintainers approve with
that changed.

Thanks!
Bill
+ +/* This test verifies that float or double vec_min/max are bound to 
+ xv[min|max][d|s]p instructions when fast-math is not set. */ + + 
+#include  + +#ifdef _BIG_ENDIAN + const int PREF_D = 0; 
+#else + const int PREF_D = 1; +#endif + +double vmaxd (double a, 
double b) +{ + vector double va = vec_promote (a, PREF_D); + vector 
double vb = vec_promote (b, PREF_D); + return vec_extract (vec_max 
(va, vb), PREF_D); +} + +double vmind (double a, double b) +{ + vector 
double va = vec_promote (a, PREF_D); + vector double vb = vec_promote 
(b, PREF_D); + return vec_extract (vec_min (va, vb), PREF_D); +} + 
+#ifdef _BIG_ENDIAN + const int PREF_F = 0; +#else + const int PREF_F 
= 3; +#endif + +float vmaxf (float a, float b) +{ + vector float va = 
vec_promote (a, PREF_F); + vector float vb = vec_promote (b, PREF_F); 
+ return vec_extract (vec_max (va, vb), PREF_F); +} + +float vminf 
(float a, float b) +{ + vector float va = vec_promote (a, PREF_F); + 
vector float vb = vec_promote (b, PREF_F); + return vec_extract 
(vec_min (va, vb), PREF_F); +} diff --git 

Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Bill Schmidt via Gcc-patches

Hi Kewen,

Sorry this sat in my queue for so long.  It looks like you addressed all 
of our concerns, so LGTM -- recommend maintainers approve.


Thanks!
Bill

On 8/12/21 9:34 PM, Kewen.Lin wrote:

Hi Segher,

Thanks for the review!

on 2021/8/12 下午11:10, Segher Boessenkool wrote:

Hi!

On Wed, Aug 11, 2021 at 02:56:11PM +0800, Kewen.Lin wrote:

* config/rs6000/rs6000.c (rs6000_builtin_md_vectorized_function): Add
support for some built-in functions vectorized on Power10.

Say which, not "some" please?


Done.


+  machine_mode in_vmode = TYPE_MODE (type_in);
+  machine_mode out_vmode = TYPE_MODE (type_out);
+
+  /* Power10 supported vectorized built-in functions.  */
+  if (TARGET_POWER10
+  && in_vmode == out_vmode
+  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
+{
+  machine_mode exp_mode = DImode;
+  machine_mode exp_vmode = V2DImode;
+  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;

"name"?  This should be "bif" or similar?


Updated with name.


+  switch (fn)
+   {
+   case MISC_BUILTIN_DIVWE:
+   case MISC_BUILTIN_DIVWEU:
+ exp_mode = SImode;
+ exp_vmode = V4SImode;
+ if (fn == MISC_BUILTIN_DIVWE)
+   vname = P10V_BUILTIN_DIVES_V4SI;
+ else
+   vname = P10V_BUILTIN_DIVEU_V4SI;
+ break;
+   case MISC_BUILTIN_DIVDE:
+   case MISC_BUILTIN_DIVDEU:
+ if (fn == MISC_BUILTIN_DIVDE)
+   vname = P10V_BUILTIN_DIVES_V2DI;
+ else
+   vname = P10V_BUILTIN_DIVEU_V2DI;
+ break;

All of the above should not be builtin functions really, they are all
simple arithmetic :-(  They should not be UNSPECs either, on RTL level.
They can and should be optimised in real code as well.  Oh well.


--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dive-vectorize-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target lp64 } */

Please add a comment what this is needed for?  "We scan for dive*d" is
enough, but without anything, it takes time to figure this out.


Done, same for below requests on lp64 commentary.


--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dive-vectorize-run-2.c
@@ -0,0 +1,53 @@
+/* { dg-do run } */
+/* { dg-require-effective-target lp64 } */

Same here.  I suppose this uses builtins that do not exist on 32-bit?


Yeah, those bifs which are guarded with lp64 in their cases are only
supported on 64-bit environment.


--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-run-1.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+/* { dg-require-effective-target lp64 } */

And another.


+#define CHECK(name)   \
+  __attribute__ ((optimize (1))) void check_##name () \

What is the attribute for, btw?  It seems fragile, but perhaps I do not
understand the intention.



It's to stop compiler from optimizing check functions with vectorization,
since the test point is to compare the results between scalar and vectorized
version.


Okay for trunk with whose lp64 things improved.  Thanks!


Thanks, v2 has been attached by addressing Bill's and your comments.  :)


BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.c (rs6000_builtin_md_vectorized_function): Add
support for built-in functions MISC_BUILTIN_DIVWE, MISC_BUILTIN_DIVWEU,
MISC_BUILTIN_DIVDE, MISC_BUILTIN_DIVDEU, P10_BUILTIN_CFUGED,
P10_BUILTIN_CNTLZDM, P10_BUILTIN_CNTTZDM, P10_BUILTIN_PDEPD and
P10_BUILTIN_PEXTD on Power10.


[r12-3108 Regression] FAIL: gcc.target/i386/sse2-shiftqihi-constant-1.c scan-assembler-times pxor[^\n]*%xmm 1 on Linux/x86_64

2021-08-24 Thread sunil.k.pandey via Gcc-patches
On Linux/x86_64,

6ddb30f941a44bd528904558673ab35394565f08 is the first bad commit
commit 6ddb30f941a44bd528904558673ab35394565f08
Author: liuhongt 
Date:   Fri Aug 20 15:30:40 2021 +0800

Optimize (a & b) | (c & ~b) to vpternlog instruction.

caused

FAIL: gcc.target/i386/avx2-shiftqihi-constant-1.c scan-assembler-times 
vpand[^\n]*%ymm 3
FAIL: gcc.target/i386/avx2-shiftqihi-constant-1.c scan-assembler-times 
vpxor[^\n]*%ymm 1
FAIL: gcc.target/i386/sse2-shiftqihi-constant-1.c scan-assembler-times 
pxor[^\n]*%xmm 1

with GCC configured with



To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/avx2-shiftqihi-constant-1.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/avx2-shiftqihi-constant-1.c 
--target_board='unix{-m64\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/sse2-shiftqihi-constant-1.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/sse2-shiftqihi-constant-1.c 
--target_board='unix{-m64\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at skpgkp2 at gmail dot com)


[committed] wwwdocs: Fix grammar in gcc-11 release notes

2021-08-24 Thread Jonathan Wakely via Gcc-patches
Committed as obvious.

commit 6404392bcf74d2af7d171cc1df9b5c001d2218f8
Author: Jonathan Wakely 
Date:   Tue Aug 24 18:06:25 2021 +0100

Fix grammar in gcc-11 release notes

diff --git a/htdocs/gcc-11/changes.html b/htdocs/gcc-11/changes.html
index b8bb2e69..6dec8856 100644
--- a/htdocs/gcc-11/changes.html
+++ b/htdocs/gcc-11/changes.html
@@ -148,8 +148,8 @@ You may also want to check out our
   This can produce up to 25% more compact debug information
   compared to earlier versions.
 
-  To take full advantage of DWARF version 5 GCC needs to be build
-  against binutils version 2.35.2 or higher.  When GCC is build
+  To take full advantage of DWARF version 5 GCC needs to be built
+  against binutils version 2.35.2 or higher.  When GCC is built
   against earlier versions of binutils GCC will still emit DWARF
   version 5 for most debuginfo data, but will generate version 4
   debug line tables (even when explicitly given -gdwarf-5).


Re: [PATCH, V2 2/3] targhooks: New target hook for CTF/BTF debug info emission

2021-08-24 Thread Indu Bhagat via Gcc-patches

On 8/18/21 12:00 AM, Richard Biener wrote:

On Tue, Aug 17, 2021 at 7:26 PM Indu Bhagat  wrote:


On 8/17/21 1:04 AM, Richard Biener wrote:

On Mon, Aug 16, 2021 at 7:39 PM Indu Bhagat  wrote:


On 8/10/21 4:54 AM, Richard Biener wrote:

On Thu, Aug 5, 2021 at 2:52 AM Indu Bhagat via Gcc-patches
 wrote:


This patch adds a new target hook to detect if the CTF container can allow the
emission of CTF/BTF debug info at DWARF debug info early finish time. Some
backends, e.g., BPF when generating code for CO-RE usecase, may need to emit
the CTF/BTF debug info sections around the time when late DWARF debug is
finalized (dwarf2out_finish).


Without looking at the dwarf2out.c usage in the next patch - I think
the CTF part
should be always emitted from dwarf2out_early_finish, the "hooks" should somehow
arrange for the alternate output specific data to be preserved until
dwarf2out_finish
time so the late BTF data can be emitted from there.

Lumping everything together now just makes it harder to see what info
is required
to persist and thus make LTO support more intrusive than necessary.


In principle, I agree the approach to split generate/emit CTF/BTF like
you mention is ideal.  But, the BTF CO-RE relocations format is such
that the .BTF section cannot be finalized until .BTF.ext contents are
all fully known (David Faust summarizes this issue in the other thread
"[PATCH, V2 3/3] dwarf2out: Emit BTF in dwarf2out_finish for BPF CO-RE
usecase".)

In summary, the .BTF.ext section refers to strings in the .BTF section.
These strings are added at the time the CO-RE relocations are added.
Recall that the .BTF section's header has information about the .BTF
string table start offset and length. So, this means the "CTF part" (or
the .BTF section) cannot simply be emitted in the dwarf2out_early_finish
because it's not ready yet. If it is still unclear, please let me know.

My judgement here is that the BTF format itself is not amenable to split
early/late emission like DWARF. BTF has no linker support yet either.


But are the strings used for the CO-RE relocations not all present already?
Or does the "CTF part" have only "foo", "bar" and "baz" while the CO-RE
part wants to output sth like "foo->bar.baz" (which IMHO would be quite
stupid also for size purposes)?



Yes, the latter ("foo->bar.baz") is closer to what the format does for
CO-RE relocations!


That said, fix the format.

Alternatively hand the CO-RE part its own string table (what's the fuss
with re-using the CTF string table if there's nothing to share ...)



BTF and .BTF.ext formats are specified already by implementations in the
kernel, libbpf, and LLVM. For that matter, I should add BPF CO-RE to the
mix and say that BPF CO-RE capability _and_ .BTF/.BTF.ext debug formats
have been defined already by the BPF kernel developers/associated
entities. At this time, we as GCC developers simply extending the BPF
backend/BTF generation support in GCC, cannot fix the format. That ship
has sailed.


Hmm, well.  How about emitting .BTF.ext.string from GCC and have the linker
merge the .BTF.ext.string section with the CTF string section then?  You can't
really say "the ship has sailed" if I read the CTF webpage - there seems to be
many format changes planned.

Well.  Guess that was it from my side on the topic of ranting about the
not well thought out debug format ;)

Richard.


Hello Richard,

As we clarified in this thread, BTF/CO-RE format cannot be changed. What 
are your thoughts on this patch set now ? Is this OK ?


Thanks
Indu


Thanks for reviewing and voicing your concerns.
Indu



Richard.


Re: [PATCH v2] x86: Allow CONST_VECTOR for vector load in combine

2021-08-24 Thread H.J. Lu via Gcc-patches
On Tue, Aug 24, 2021 at 9:16 AM Segher Boessenkool
 wrote:
>
> On Tue, Aug 24, 2021 at 09:57:52AM +0800, Hongtao Liu wrote:
> > Trying 5 -> 7:
> > 5: r85:V4SF=[`*.LC0']
> >   REG_EQUAL const_vector
> > 7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
> >   REG_DEAD r85:V4SF
> >   REG_EQUAL const_vector
> > Failed to match this instruction:
> > (set (reg:V4SF 84)
> > (const_vector:V4SF [
> > (const_double:SF 3.0e+0 [0x0.cp+2])
> > (const_double:SF 2.0e+0 [0x0.8p+2])
> > (const_double:SF 4.0e+0 [0x0.8p+3])
> > (const_double:SF 1.0e+0 [0x0.8p+1])
> > ]))
> >
> > (insn 5 2 7 2 (set (reg:V4SF 85)
> > (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16
> > A128])) 
> > "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
> > 1600 {movv4sf_internal}
> >  (expr_list:REG_EQUAL (const_vector:V4SF [
> > (const_double:SF 4.0e+0 [0x0.8p+3])
> > (const_double:SF 3.0e+0 [0x0.cp+2])
> > (const_double:SF 2.0e+0 [0x0.8p+2])
> > (const_double:SF 1.0e+0 [0x0.8p+1])
> > ])
> > (nil)))
> > (insn 7 5 11 2 (set (reg:V4SF 84)
> > (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 85)
> > (reg:V4SF 85))
> > (parallel [
> > (const_int 1 [0x1])
> > (const_int 2 [0x2])
> > (const_int 4 [0x4])
> > (const_int 7 [0x7])
> > ])))
> > "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
> > 3015 {sse_shufps_v4sf}
> >  (expr_list:REG_DEAD (reg:V4SF 85)
> > (expr_list:REG_EQUAL (const_vector:V4SF [
> > (const_double:SF 3.0e+0 [0x0.cp+2])
> > (const_double:SF 2.0e+0 [0x0.8p+2])
> > (const_double:SF 4.0e+0 [0x0.8p+3])
> > (const_double:SF 1.0e+0 [0x0.8p+1])
> > ])
> > (nil
> >
> > I think pass_combine should be extended to force illegitimate constant
> > to constant pool and recog load insn again, It looks like a general
> > optimization that better not do it in the backend.
>
> Patches welcome.  You should do this like change_zero_ext is done, and
> perhaps make sure you do not introduce new is_just_move insns that can
> make 2->2 combinations do the wrong thing.
>
> Also somehow make this not take exponential time?  It looks like this
> should onle be done in cases where change_zero_ext is not, and the
> reverse, so this will work fine with a little attention to detail.
>

The combine patch is here:

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578017.html

Thanks.

-- 
H.J.


Re: [PATCH 08/34] rs6000: Add Power9 builtins

2021-08-24 Thread Bill Schmidt via Gcc-patches

On 8/24/21 10:38 AM, Segher Boessenkool wrote:

Hi!

On Tue, Aug 24, 2021 at 09:20:09AM -0500, Bill Schmidt wrote:

On 8/23/21 4:40 PM, Segher Boessenkool wrote:

On Thu, Jul 29, 2021 at 08:30:55AM -0500, Bill Schmidt wrote:

+; These things need some review to see whether they really require
+; MASK_POWERPC64.  For xsxexpdp, this seems to be fine for 32-bit,
+; because the result will always fit in 32 bits and the return
+; value is SImode; but the pattern currently requires TARGET_64BIT.

That is wrong then?  It should never have TARGET_64BIT if it isn't
addressing memory (or the like).  Did you just typo this?

Not a typo... I was referring to the condition in the following:

;; VSX Scalar Extract Exponent Double-Precision
(define_insn "xsxexpdp"
   [(set (match_operand:DI 0 "register_operand" "=r")
 (unspec:DI [(match_operand:DF 1 "vsx_register_operand" "wa")]
  UNSPEC_VSX_SXEXPDP))]
   "TARGET_P9_VECTOR && TARGET_64BIT"
   "xsxexpdp %0,%x1"
   [(set_attr "type" "integer")])

That looks wrong.  It should be TARGET_POWERPC64 afaics.


+; On the other hand, xsxsigdp has a result that doesn't fit in
+; 32 bits, and the return value is DImode, so it seems that
+; TARGET_64BIT (actually TARGET_POWERPC64) is justified.  TBD. 

Because xsxsigdp needs it, it makes sense to have it for xsxexpdp as
well, or we would get a weird holey API.

Both should have TARGET_POWERPC64 (and the underlying patterns as well
of course, we don't like ICEs so much).


Yes, the enablement support I've added uses TARGET_POWERPC64.  I think 
we need a separate patch to fix the patterns in vsx.md. I'll take a note 
on that.


Thanks!
Bill



Segher


Re: [PATCH v2] x86: Allow CONST_VECTOR for vector load in combine

2021-08-24 Thread Segher Boessenkool
On Tue, Aug 24, 2021 at 09:57:52AM +0800, Hongtao Liu wrote:
> Trying 5 -> 7:
> 5: r85:V4SF=[`*.LC0']
>   REG_EQUAL const_vector
> 7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
>   REG_DEAD r85:V4SF
>   REG_EQUAL const_vector
> Failed to match this instruction:
> (set (reg:V4SF 84)
> (const_vector:V4SF [
> (const_double:SF 3.0e+0 [0x0.cp+2])
> (const_double:SF 2.0e+0 [0x0.8p+2])
> (const_double:SF 4.0e+0 [0x0.8p+3])
> (const_double:SF 1.0e+0 [0x0.8p+1])
> ]))
> 
> (insn 5 2 7 2 (set (reg:V4SF 85)
> (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16
> A128])) 
> "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
> 1600 {movv4sf_internal}
>  (expr_list:REG_EQUAL (const_vector:V4SF [
> (const_double:SF 4.0e+0 [0x0.8p+3])
> (const_double:SF 3.0e+0 [0x0.cp+2])
> (const_double:SF 2.0e+0 [0x0.8p+2])
> (const_double:SF 1.0e+0 [0x0.8p+1])
> ])
> (nil)))
> (insn 7 5 11 2 (set (reg:V4SF 84)
> (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 85)
> (reg:V4SF 85))
> (parallel [
> (const_int 1 [0x1])
> (const_int 2 [0x2])
> (const_int 4 [0x4])
> (const_int 7 [0x7])
> ])))
> "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
> 3015 {sse_shufps_v4sf}
>  (expr_list:REG_DEAD (reg:V4SF 85)
> (expr_list:REG_EQUAL (const_vector:V4SF [
> (const_double:SF 3.0e+0 [0x0.cp+2])
> (const_double:SF 2.0e+0 [0x0.8p+2])
> (const_double:SF 4.0e+0 [0x0.8p+3])
> (const_double:SF 1.0e+0 [0x0.8p+1])
> ])
> (nil
> 
> I think pass_combine should be extended to force illegitimate constant
> to constant pool and recog load insn again, It looks like a general
> optimization that better not do it in the backend.

Patches welcome.  You should do this like change_zero_ext is done, and
perhaps make sure you do not introduce new is_just_move insns that can
make 2->2 combinations do the wrong thing.

Also somehow make this not take exponential time?  It looks like this
should onle be done in cases where change_zero_ext is not, and the
reverse, so this will work fine with a little attention to detail.

gl;hf,


Segher


Re: [PATCH] nvptx: Add a __PTX_ISA__ predefined macro based on target ISA.

2021-08-24 Thread Tom de Vries via Gcc-patches
On 8/20/21 12:54 AM, Roger Sayle wrote:
> 
> This patch adds a __PTX_ISA__ predefined macro to the nvptx backend that
> allows code to check the compute model being targeted by the compiler.

Hi Roger,

The naming __PTX_ISA__ is consistent with the naming of -misa=sm_30/sm_35.

The -misa=sm_30/sm_35 naming was very unfortunate given that the ptx
format actually defines an ISA version which gcc now accepts using
-mptx=3.1/6.3.

We really should have had something like:
- -march=sm_30/sm_35
- -mptx-isa=3.1/6.3
but I suppose it's too late to change that now.

Having said that, the __PTX_ISA__ name very much suggests that it's the
ptx ISA version, which, as explained above, it's not.  Sigh.

We could go for __PTX_ARCH__ instead, but it would be very
counterintuitive to have -misa=sm_30/sm_35 set this.

So I propose __PTX_SM__ instead.  [ I also considered __PTX_ISA_SM__ but
if we ever decide to change the name of the switch then that doesn't
make sense anymore. ]

> This is equivalent to the __CUDA_ARCH__ macro defined by CUDA's nvcc
> compiler, but to avoid causing problems for source code that checks
> for that compiler, this macro uses GCC's nomenclature; it's easy
> enough for users to "#define __CUDA_ARCH__ __PTX_ISA__", but I'm
> also happy to modify this patch to define __CUDA_ARCH__ if that's
> the preference of the nvptx backend maintainers.
> 

I agree with this approach.  The definition of the __CUDA_ARCH__ macro
in the cuda documentation is nvcc-specific, so let's not define it.

> What might have been a four line patch is actually a little more
> complicated, as this patch takes the opportunity to upgrade the
> nvptx backend to use the now preferred nvptx-c.c idiom.
> 

Ack.  You could split it up, but not strictly necessary.

> This patch has been tested with a cross-compiler from
> x86_64-pc-linux-gnu to nvptx-none, and tested with
> "make -k check" with no new failures.  This feature is
> useful for implementing clock() on nvptx in newlib.
> 

I see, thanks for working on that.

> Ok for mainline?

OK with name of predefined macro updated to __PTX_SM__ .

Thanks,
- Tom

> 2021-08-19  Roger Sayle  
> 
> gcc/ChangeLog
>   * config.gcc (nvptx-*-*): Define {c,c++}_target_objs.
>   * config/nvptx/nvptx-protos.h (nvptx_cpu_cpp_builtins): Prototype.
>   * config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Implement with
>   a call to the new nvptx_cpu_cpp_builtins function in nvptx-c.c.
>   * config/nvptx/t-nvptx (nvptx-c.o): New rule.
>   * config/nvptx/nvptx-c.c: New source file.
>   (nvptx_cpu_cpp_builtins): Move implementation here.
> 
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
> 


Re: [PATCH] dwarf: Multi-register CFI address support.

2021-08-24 Thread Hafiz Abid Qadeer
Ping.

On 22/07/2021 11:58, Hafiz Abid Qadeer wrote:
> Ping.
> 
> On 13/06/2021 14:27, Hafiz Abid Qadeer wrote:
>> Add support for architectures such as AMD GCN, in which the pointer size is
>> larger than the register size.  This allows the CFI information to include
>> multi-register locations for the stack pointer, frame pointer, and return
>> address.
>>
>> This patch was originally posted by Andrew Stubbs in
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552873.html
>>
>> It has now been re-worked according to the review comments. It does not use
>> DW_OP_piece or DW_OP_LLVM_piece_end. Instead it uses
>> DW_OP_bregx/DW_OP_shl/DW_OP_bregx/DW_OP_plus to build the CFA from multiple
>> consecutive registers. Here is how .debug_frame looks before and after this
>> patch:
>>
>> $ cat factorial.c
>> int factorial(int n) {
>>   if (n == 0) return 1;
>>   return n * factorial (n - 1);
>> }
>>
>> $ amdgcn-amdhsa-gcc -g factorial.c -O0 -c -o fac.o
>> $ llvm-dwarfdump -debug-frame fac.o
>>
>> *** without this patch (edited for brevity)***
>>
>>  0014  CIE
>>
>>   DW_CFA_def_cfa: reg48 +0
>>   DW_CFA_register: reg16 reg50
>>
>> 0018 002c  FDE cie= pc=...01ac
>>   DW_CFA_advance_loc4: 96
>>   DW_CFA_offset: reg46 0
>>   DW_CFA_offset: reg47 4
>>   DW_CFA_offset: reg50 8
>>   DW_CFA_offset: reg51 12
>>   DW_CFA_offset: reg16 8
>>   DW_CFA_advance_loc4: 4
>>   DW_CFA_def_cfa_sf: reg46 -16
>>
>> *** with this patch (edited for brevity)***
>>
>>  0024  CIE
>>
>>   DW_CFA_def_cfa_expression: DW_OP_bregx SGPR49+0, DW_OP_const1u 0x20, 
>> DW_OP_shl, DW_OP_bregx SGPR48+0, DW_OP_plus
>>   DW_CFA_expression: reg16 DW_OP_bregx SGPR51+0, DW_OP_const1u 0x20, 
>> DW_OP_shl, DW_OP_bregx SGPR50+0, DW_OP_plus
>>
>> 0028 003c  FDE cie= pc=...01ac
>>   DW_CFA_advance_loc4: 96
>>   DW_CFA_offset: reg46 0
>>   DW_CFA_offset: reg47 4
>>   DW_CFA_offset: reg50 8
>>   DW_CFA_offset: reg51 12
>>   DW_CFA_offset: reg16 8
>>   DW_CFA_advance_loc4: 4
>>   DW_CFA_def_cfa_expression: DW_OP_bregx SGPR47+0, DW_OP_const1u 0x20, 
>> DW_OP_shl, DW_OP_bregx SGPR46+0, DW_OP_plus, DW_OP_lit16, DW_OP_minus
>>
>> gcc/ChangeLog:
>>
>>  * dwarf2cfi.c (dw_stack_pointer_regnum): Change type to struct cfa_reg.
>>  (dw_frame_pointer_regnum): Likewise.
>>  (new_cfi_row): Use set_by_dwreg.
>>  (get_cfa_from_loc_descr): Use set_by_dwreg.  Support register spans.
>>  handle DW_OP_bregx with DW_OP_breg{0-31}. Support DW_OP_lit*,
>>  DW_OP_const*, DW_OP_minus, DW_OP_shl and DW_OP_plus.
>>  (lookup_cfa_1): Use set_by_dwreg.
>>  (def_cfa_0): Update for cfa_reg and support register spans.
>>  (reg_save): Change sreg parameter to struct cfa_reg.  Support register
>>  spans.
>>  (dwf_cfa_reg): New function.
>>  (dwarf2out_flush_queued_reg_saves): Use dwf_cfa_reg instead of
>>  dwf_regno.
>>  (dwarf2out_frame_debug_def_cfa): Likewise.
>>  (dwarf2out_frame_debug_adjust_cfa): Likewise.
>>  (dwarf2out_frame_debug_cfa_offset): Likewise.  Update reg_save usage.
>>  (dwarf2out_frame_debug_cfa_register): Likewise.
>>  (dwarf2out_frame_debug_expr): Likewise.
>>  (create_pseudo_cfg): Use set_by_dwreg.
>>  (initial_return_save): Use set_by_dwreg and dwf_cfa_reg,
>>  (create_cie_data): Use dwf_cfa_reg.
>>  (execute_dwarf2_frame): Use dwf_cfa_reg.
>>  (dump_cfi_row): Use set_by_dwreg.
>>  * dwarf2out.c (build_span_loc, build_breg_loc): New function.
>>  (build_cfa_loc): Support register spans.
>>  (build_cfa_aligned_loc): Update cfa_reg usage.
>>  (convert_cfa_to_fb_loc_list): Use set_by_dwreg.
>>  * dwarf2out.h (struct cfa_reg): New type.
>>  (struct dw_cfa_location): Use struct cfa_reg.
>>  (build_span_loc): New prototype.
>>  * gengtype.c (main): Accept poly_uint16_pod type.
>> ---
>>  gcc/dwarf2cfi.c | 260 
>>  gcc/dwarf2out.c |  55 +-
>>  gcc/dwarf2out.h |  37 ++-
>>  gcc/gengtype.c  |   1 +
>>  4 files changed, 283 insertions(+), 70 deletions(-)
>>
>> diff --git a/gcc/dwarf2cfi.c b/gcc/dwarf2cfi.c
>> index c27ac1960b0..5aacdcd094a 100644
>> --- a/gcc/dwarf2cfi.c
>> +++ b/gcc/dwarf2cfi.c
>> @@ -229,8 +229,8 @@ static vec queued_reg_saves;
>>  static bool any_cfis_emitted;
>>  
>>  /* Short-hand for commonly used register numbers.  */
>> -static unsigned dw_stack_pointer_regnum;
>> -static unsigned dw_frame_pointer_regnum;
>> +static struct cfa_reg dw_stack_pointer_regnum;
>> +static struct cfa_reg dw_frame_pointer_regnum;
>>  
>>  /* Hook used by __throw.  */
>>  
>> @@ -430,7 +430,7 @@ new_cfi_row (void)
>>  {
>>dw_cfi_row *row = ggc_cleared_alloc ();
>>  
>> -  row->cfa.reg = INVALID_REGNUM;
>> +  row->cfa.reg.set_by_dwreg (INVALID_REGNUM);
>>  
>>return row;
>>  }
>> @@ -538,7 +538,7 @@ get_cfa_from_loc_descr (dw_cfa_location *cfa, struct 
>> dw_loc_descr_node *loc)

Re: [PATCH 08/34] rs6000: Add Power9 builtins

2021-08-24 Thread Segher Boessenkool
Hi!

On Tue, Aug 24, 2021 at 09:20:09AM -0500, Bill Schmidt wrote:
> On 8/23/21 4:40 PM, Segher Boessenkool wrote:
> >On Thu, Jul 29, 2021 at 08:30:55AM -0500, Bill Schmidt wrote:
> >>+; These things need some review to see whether they really require
> >>+; MASK_POWERPC64.  For xsxexpdp, this seems to be fine for 32-bit,
> >>+; because the result will always fit in 32 bits and the return
> >>+; value is SImode; but the pattern currently requires TARGET_64BIT.
> >That is wrong then?  It should never have TARGET_64BIT if it isn't
> >addressing memory (or the like).  Did you just typo this?
> 
> Not a typo... I was referring to the condition in the following:
> 
> ;; VSX Scalar Extract Exponent Double-Precision
> (define_insn "xsxexpdp"
>   [(set (match_operand:DI 0 "register_operand" "=r")
> (unspec:DI [(match_operand:DF 1 "vsx_register_operand" "wa")]
>  UNSPEC_VSX_SXEXPDP))]
>   "TARGET_P9_VECTOR && TARGET_64BIT"
>   "xsxexpdp %0,%x1"
>   [(set_attr "type" "integer")])

That looks wrong.  It should be TARGET_POWERPC64 afaics.

> >>+; On the other hand, xsxsigdp has a result that doesn't fit in
> >>+; 32 bits, and the return value is DImode, so it seems that
> >>+; TARGET_64BIT (actually TARGET_POWERPC64) is justified.  TBD. 
> >Because xsxsigdp needs it, it makes sense to have it for xsxexpdp as
> >well, or we would get a weird holey API.

Both should have TARGET_POWERPC64 (and the underlying patterns as well
of course, we don't like ICEs so much).


Segher


Re: [PATCH] c++: Fix unnecessary error when top-level cv-qualifiers is dropped [PR101783]

2021-08-24 Thread Jonathan Wakely via Gcc-patches
>PR c++/101387
>
>gcc/cp/ChangeLog:
>PR c++/101387
>* tree.c (cp_build_qualified_type_real): Excluding typedef from error
>
>gcc/testsuite/ChangeLog:
>PR c++/101387
>* g++.dg/parse/pr101783.C: New test.

This is the wrong PR number.

How has this ptch been tested? Have you bootstrapped the compiler and
run the full testsuite?





Re: [PATCH] i386: Add peephole for lea and zero extend [PR 101716]

2021-08-24 Thread Hongyu Wang via Gcc-patches
Hi Uros,

Sorry for the late update. I have tried adjusting the combine pass but
found it is not easy to modify shift const, so I came up with an
alternative solution with your patch. It matches the non-canonical
zero-extend in ix86_decompose_address and adjust ix86_rtx_cost to
combine below pattern

(set (reg:DI 85)
   (and:DI (ashift:DI (reg:DI 87)
   (const_int 1 [0x1]))
   (const_int 4294967294 [0xfffe])))

Survived bootstrap and regtest on x86-64-linux. Ok for master?

Uros Bizjak  于2021年8月16日周一 下午5:26写道:

>
> On Mon, Aug 16, 2021 at 11:18 AM Hongyu Wang  wrote:
> >
> > > So, the question is if the combine pass really needs to zero-extend
> > > with 0xfffe, the left shift << 1 guarantees zero in the LSB, so
> > > 0x should be better and in line with canonical zero-extension
> > > RTX.
> >
> > The shift mask is generated in simplify_shift_const_1:
> >
> > mask_rtx = gen_int_mode (nonzero_bits (varop, int_varop_mode),
> >  int_result_mode);
> > rtx count_rtx = gen_int_shift_amount (int_result_mode, count);
> > mask_rtx
> >   = simplify_const_binary_operation (code, int_result_mode,
> >  mask_rtx, count_rtx);
> >
> > Can we adjust the count for ashift if nonzero_bits overlaps it?
> >
> > > Also, ix86_decompose_address accepts ASHIFT RTX when ASHIFT is
> > > embedded in the PLUS chain, but naked ASHIFT is rejected (c.f. the
> > > call in ix86_legitimate_address_p) for some (historic?) reason. It
> > > looks to me that this restriction is not necessary, since
> > > ix86_legitimize_address can canonicalize ASHIFT RTXes without
> > > problems. The attached patch that survives bootstrap and regtest can
> > > help in your case.
> >
> > We have a split to transform ashift to mult, I'm afraid it could not
> > help this issue.
>
> If you want existing *lea to accept ASHIFT RTX, it uses
> address_no_seg_operand predicate which uses address_operand predicate,
> which calls ix86_legitimate_address_p, which ATM rejects ASHIFT RTXes.
>
> Uros.
From 4bcebb985439867d12f2038e97c72baaf092ffbf Mon Sep 17 00:00:00 2001
From: Hongyu Wang 
Date: Tue, 17 Aug 2021 16:53:46 +0800
Subject: [PATCH] i386: Optimize lea with zero-extend. [PR 101716]

For ASHIFT + ZERO_EXTEND pattern, combine pass failed to
match it to lea since it will generate non-canonical
zero-extend. Adjust predicate and cost_model to allow combine
for lea.

gcc/ChangeLog:

	PR target/101716
	* config/i386/i386.c (ix86_live_on_entry): Adjust comment.
	(ix86_decompose_address): Remove retval check for ASHIFT,
	allow non-canonical zero extend if AND mask covers ASHIFT
	count.
	(ix86_legitimate_address_p): Adjust condition for decompose.
	(ix86_rtx_costs): Adjust cost for lea with non-canonical
	zero-extend.

	Co-Authored by: Uros Bizjak 

gcc/testsuite/ChangeLog:

	PR target/101716
	* gcc.target/i386/pr101716.c: New test.
---
 gcc/config/i386/i386.c   | 36 
 gcc/testsuite/gcc.target/i386/pr101716.c | 11 
 2 files changed, 41 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101716.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 5bff131f6d9..a997fc04004 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -10018,8 +10018,7 @@ ix86_live_on_entry (bitmap regs)
 
 /* Extract the parts of an RTL expression that is a valid memory address
for an instruction.  Return 0 if the structure of the address is
-   grossly off.  Return -1 if the address contains ASHIFT, so it is not
-   strictly valid, but still used for computing length of lea instruction.  */
+   grossly off.  */
 
 int
 ix86_decompose_address (rtx addr, struct ix86_address *out)
@@ -10029,7 +10028,6 @@ ix86_decompose_address (rtx addr, struct ix86_address *out)
   HOST_WIDE_INT scale = 1;
   rtx scale_rtx = NULL_RTX;
   rtx tmp;
-  int retval = 1;
   addr_space_t seg = ADDR_SPACE_GENERIC;
 
   /* Allow zero-extended SImode addresses,
@@ -10053,6 +10051,27 @@ ix86_decompose_address (rtx addr, struct ix86_address *out)
 	  if (CONST_INT_P (addr))
 	return 0;
 	}
+  else if (GET_CODE (addr) == AND)
+	{
+	  /* For ASHIFT inside AND, combine will not generate
+	 canonical zero-extend. Merge mask for AND and shift_count
+	 to check if it is canonical zero-extend.  */
+	  tmp = XEXP (addr, 0);
+	  rtx mask = XEXP (addr, 1);
+	  if (tmp && GET_CODE(tmp) == ASHIFT)
+	{
+	  rtx shift_val = XEXP (tmp, 1);
+	  if (CONST_INT_P (mask) && CONST_INT_P (shift_val)
+		  && (((unsigned HOST_WIDE_INT) INTVAL(mask)
+		  | (HOST_WIDE_INT_1U << (INTVAL(shift_val) - 1)))
+		  == 0x))
+		{
+		  addr = lowpart_subreg (SImode, XEXP (addr, 0),
+	 DImode);
+		}
+	}
+
+	}
 }
 
   /* Allow SImode subregs of DImode addresses,
@@ -10179,7 +10198,6 @@ ix86_decompose_address (rtx addr, struct ix86_address *out)
   if ((unsigned HOST_WIDE_INT) scale > 3)
 	return 0;
   scale = 

Re: [committed] libstdc++: Add std::is_layout_compatible trait for C++20

2021-08-24 Thread Jonathan Wakely via Gcc-patches

On 24/08/21 16:13 +0100, Jonathan Wakely wrote:

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* include/std/type_traits (is_layout_compatible): Define.
(is_corresponding_member): Define.
* include/std/version (__cpp_lib_is_layout_compatible): Define.
* testsuite/20_util/is_layout_compatible/is_corresponding_member.cc:
New test.
* testsuite/20_util/is_layout_compatible/value.cc: New test.
* testsuite/20_util/is_layout_compatible/version.cc: New test.
* testsuite/20_util/is_pointer_interconvertible/with_class.cc:
New test.
* testsuite/23_containers/span/layout_compat.cc: Do not use real
std::is_layout_compatible trait if available.



And the doc patch, also pushed to trunk.


commit 6d692ef43b2b3368c92c3fb757c7884fc94ee627
Author: Jonathan Wakely 
Date:   Tue Aug 24 16:15:48 2021

libstdc++: Update C++20 status table for layout-compatibility traits

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* doc/xml/manual/status_cxx2020.xml: Update table.
* doc/html/manual/status.html: Regenerate.

diff --git a/libstdc++-v3/doc/xml/manual/status_cxx2020.xml b/libstdc++-v3/doc/xml/manual/status_cxx2020.xml
index a729ddd3ada..26c882907f3 100644
--- a/libstdc++-v3/doc/xml/manual/status_cxx2020.xml
+++ b/libstdc++-v3/doc/xml/manual/status_cxx2020.xml
@@ -294,13 +294,12 @@ or any notes about the implementation.
 
 
 
-  
Layout-compatibility and pointer-interconvertibility traits 
   
 http://www.w3.org/1999/xlink; xlink:href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0466r5.pdf;>
 P0466R5 
   
-   
+   12 
   
 
  __cpp_lib_is_layout_compatible = 201907L 


[committed] libstdc++: Add std::is_layout_compatible trait for C++20

2021-08-24 Thread Jonathan Wakely via Gcc-patches
Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* include/std/type_traits (is_layout_compatible): Define.
(is_corresponding_member): Define.
* include/std/version (__cpp_lib_is_layout_compatible): Define.
* testsuite/20_util/is_layout_compatible/is_corresponding_member.cc:
New test.
* testsuite/20_util/is_layout_compatible/value.cc: New test.
* testsuite/20_util/is_layout_compatible/version.cc: New test.
* testsuite/20_util/is_pointer_interconvertible/with_class.cc:
New test.
* testsuite/23_containers/span/layout_compat.cc: Do not use real
std::is_layout_compatible trait if available.

Tested powerpc64le-linux. Committed to trunk.

commit 037ef219b27c26d4c125368e685a89da7f8cc701
Author: Jonathan Wakely 
Date:   Tue Aug 24 14:42:37 2021

libstdc++: Add std::is_layout_compatible trait for C++20

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* include/std/type_traits (is_layout_compatible): Define.
(is_corresponding_member): Define.
* include/std/version (__cpp_lib_is_layout_compatible): Define.
* testsuite/20_util/is_layout_compatible/is_corresponding_member.cc:
New test.
* testsuite/20_util/is_layout_compatible/value.cc: New test.
* testsuite/20_util/is_layout_compatible/version.cc: New test.
* testsuite/20_util/is_pointer_interconvertible/with_class.cc:
New test.
* testsuite/23_containers/span/layout_compat.cc: Do not use real
std::is_layout_compatible trait if available.

diff --git a/libstdc++-v3/include/std/type_traits 
b/libstdc++-v3/include/std/type_traits
index 15718000800..a0010d960b2 100644
--- a/libstdc++-v3/include/std/type_traits
+++ b/libstdc++-v3/include/std/type_traits
@@ -3414,6 +3414,31 @@ template
 inline constexpr bool is_unbounded_array_v
   = is_unbounded_array<_Tp>::value;
 
+#if __has_builtin(__is_layout_compatible)
+
+  /// @since C++20
+  template
+struct is_layout_compatible
+: bool_constant<__is_layout_compatible(_Tp, _Up)>
+{ };
+
+  /// @ingroup variable_templates
+  /// @since C++20
+  template
+constexpr bool is_layout_compatible_v
+  = __is_layout_compatible(_Tp, _Up);
+
+#if __has_builtin(__builtin_is_corresponding_member)
+#define __cpp_lib_is_layout_compatible 201907L
+
+  /// @since C++20
+  template
+constexpr bool
+is_corresponding_member(_M1 _S1::*__m1, _M2 _S2::*__m2) noexcept
+{ return __builtin_is_corresponding_member(__m1, __m2); }
+#endif
+#endif
+
 #if __has_builtin(__is_pointer_interconvertible_base_of)
   /// True if `_Derived` is standard-layout and has a base class of type 
`_Base`
   /// @since C++20
diff --git a/libstdc++-v3/include/std/version b/libstdc++-v3/include/std/version
index 925f27704c4..70d573bb517 100644
--- a/libstdc++-v3/include/std/version
+++ b/libstdc++-v3/include/std/version
@@ -236,6 +236,10 @@
 #ifdef _GLIBCXX_HAS_GTHREADS
 # define __cpp_lib_jthread 201911L
 #endif
+#if __has_builtin(__is_layout_compatible) \
+  && __has_builtin(__builtin_is_corresponding_member)
+# define __cpp_lib_is_layout_compatible 201907L
+#endif
 #if __has_builtin(__is_pointer_interconvertible_base_of) \
  && __has_builtin(__builtin_is_pointer_interconvertible_with_class)
 # define __cpp_lib_is_pointer_interconvertible 201907L
diff --git 
a/libstdc++-v3/testsuite/20_util/is_layout_compatible/is_corresponding_member.cc
 
b/libstdc++-v3/testsuite/20_util/is_layout_compatible/is_corresponding_member.cc
new file mode 100644
index 000..69b359aa1d5
--- /dev/null
+++ 
b/libstdc++-v3/testsuite/20_util/is_layout_compatible/is_corresponding_member.cc
@@ -0,0 +1,19 @@
+// { dg-options "-std=gnu++20" }
+// { dg-do compile { target c++20 } }
+#include 
+
+using std::is_corresponding_member;
+
+struct A { int a; };
+struct B { int b; };
+struct C: public A, public B { };  // not a standard-layout class
+
+static_assert( is_corresponding_member( ::a, ::b ) );
+// Succeeds because arguments have types int A::* and int B::*
+
+constexpr int C::*a = ::a;
+constexpr int C::*b = ::b;
+static_assert( ! is_corresponding_member( a, b ) );
+// Not corresponding members, because arguments both have type int C::*
+
+static_assert( noexcept(!is_corresponding_member(a, b)) );
diff --git a/libstdc++-v3/testsuite/20_util/is_layout_compatible/value.cc 
b/libstdc++-v3/testsuite/20_util/is_layout_compatible/value.cc
new file mode 100644
index 000..7686b34fc5a
--- /dev/null
+++ b/libstdc++-v3/testsuite/20_util/is_layout_compatible/value.cc
@@ -0,0 +1,56 @@
+// { dg-options "-std=gnu++20" }
+// { dg-do compile { target c++20 } }
+#include 
+
+#ifndef __cpp_lib_is_layout_compatible
+# error "Feature test macro for is_layout_compatible is missing in 
"
+#elif __cpp_lib_is_layout_compatible < 201907L
+# error "Feature test macro for is_layout_compatible has wrong value in 
"
+#endif

Re: [GCC-11] [PATCH 0/5] Finish and general-regs-only

2021-08-24 Thread H.J. Lu via Gcc-patches
On Sun, Aug 15, 2021 at 11:11 PM Richard Biener
 wrote:
>
> On Fri, Aug 13, 2021 at 3:51 PM H.J. Lu  wrote:
> >
> >  and target("general-regs-only") function attribute
> > were added to GCC 11.  But their implementations are incomplete.  I'd
> > like to backport the following patches to GCC 11 branch to finish them.
>
> Fine with me if x86 maintainers do not disagree (also see one comment I have
> on the -mwait adding patch).

Hi Uros, Honza,

Do you have any comments?  The updated -mwait patch with LTO_minor_version
bump is at:

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577471.html

Thanks.

H.J.
> > H.J. Lu (5):
> >   x86: Add -mmwait for -mgeneral-regs-only
> >   x86: Use crc32 target option for CRC32 intrinsics
> >   x86: Remove OPTION_MASK_ISA_SSE4_2 from CRC32 _builtin functions
> >   x86: Enable the GPR only instructions for -mgeneral-regs-only
> >   : Add pragma GCC target("general-regs-only")
> >
> >  gcc/common/config/i386/i386-common.c   |  45 ++-
> >  gcc/config.gcc |   6 +-
> >  gcc/config/i386/i386-builtin.def   |   8 +-
> >  gcc/config/i386/i386-builtins.c|   4 +-
> >  gcc/config/i386/i386-c.c   |   2 +
> >  gcc/config/i386/i386-options.c |  12 +
> >  gcc/config/i386/i386.c |   6 +-
> >  gcc/config/i386/i386.h |   2 +
> >  gcc/config/i386/i386.md|   4 +-
> >  gcc/config/i386/i386.opt   |   4 +
> >  gcc/config/i386/ia32intrin.h   |  42 ++-
> >  gcc/config/i386/mwaitintrin.h  |  52 +++
> >  gcc/config/i386/pmmintrin.h|  13 +-
> >  gcc/config/i386/serializeintrin.h  |   7 +-
> >  gcc/config/i386/sse.md |   4 +-
> >  gcc/config/i386/x86gprintrin.h |  13 +
> >  gcc/doc/extend.texi|   5 +
> >  gcc/doc/invoke.texi|   8 +-
> >  gcc/testsuite/gcc.target/i386/crc32-6.c|  13 +
> >  gcc/testsuite/gcc.target/i386/monitor-2.c  |  27 ++
> >  gcc/testsuite/gcc.target/i386/pr101492-1.c |  10 +
> >  gcc/testsuite/gcc.target/i386/pr101492-2.c |  10 +
> >  gcc/testsuite/gcc.target/i386/pr101492-3.c |  10 +
> >  gcc/testsuite/gcc.target/i386/pr101492-4.c |  12 +
> >  gcc/testsuite/gcc.target/i386/pr99744-3.c  |  13 +
> >  gcc/testsuite/gcc.target/i386/pr99744-4.c  | 357 +
> >  gcc/testsuite/gcc.target/i386/pr99744-5.c  |  25 ++
> >  gcc/testsuite/gcc.target/i386/pr99744-6.c  |  23 ++
> >  gcc/testsuite/gcc.target/i386/pr99744-7.c  |  12 +
> >  gcc/testsuite/gcc.target/i386/pr99744-8.c  |  13 +
> >  30 files changed, 717 insertions(+), 45 deletions(-)
> >  create mode 100644 gcc/config/i386/mwaitintrin.h
> >  create mode 100644 gcc/testsuite/gcc.target/i386/crc32-6.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/monitor-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101492-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101492-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101492-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101492-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-5.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-6.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-7.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-8.c
> >
> > --
> > 2.31.1
> >



--
H.J.


Re: [PATCH 09/34] rs6000: Add more type nodes to support builtin processing

2021-08-24 Thread Bill Schmidt via Gcc-patches



On 8/23/21 5:15 PM, Segher Boessenkool wrote:

On Thu, Jul 29, 2021 at 08:30:56AM -0500, Bill Schmidt wrote:

* config/rs6000/rs6000-call.c (rs6000_init_builtins): Initialize
various pointer type nodes.
* config/rs6000/rs6000.h (rs6000_builtin_type_index): Add enum
values for various pointer types.
(ptr_V16QI_type_node): New macro.

[ ... ]


(ptr_long_long_unsigned_type_node): New macro.



+  ptr_long_integer_type_node
+= build_pointer_type
+   (build_qualified_type (long_integer_type_internal_node,
+  TYPE_QUAL_CONST));
+
+  ptr_long_unsigned_type_node
+= build_pointer_type
+   (build_qualified_type (long_unsigned_type_internal_node,
+  TYPE_QUAL_CONST));

This isn't correct formatting either.  Just use a temp variable?  Long
names and function calls do not mix, moreso with our coding conventions.

   tree t = build_qualified_type (long_unsigned_type_internal_node,
 TYPE_QUAL_CONST));
   ptr_long_unsigned_type_node = build_pointer_type (t);

Good choice, will do.

+  if (dfloat64_type_node)
+ptr_dfloat64_type_node
+  = build_pointer_type (build_qualified_type (dfloat64_type_internal_node,

You might want to use a block to make this a little more readable / less
surprising.  Okay either way.

Yep.  Will use a temp variable again and that will force the block.

@@ -2517,6 +2558,47 @@ enum rs6000_builtin_type_index
  #define vector_pair_type_node  
(rs6000_builtin_types[RS6000_BTI_vector_pair])
  #define vector_quad_type_node  
(rs6000_builtin_types[RS6000_BTI_vector_quad])
  #define pcvoid_type_node   
(rs6000_builtin_types[RS6000_BTI_const_ptr_void])
+#define ptr_V16QI_type_node 
(rs6000_builtin_types[RS6000_BTI_ptr_V16QI])

Not new of course, but those outer parens are pointless.  In macros
write extra parens around uses of parameters, and nowhere else.

Okay for trunk with the formatting fixed.  Thanks!


Thanks for the review!

Bill



Segher


Re: [PATCH 08/34] rs6000: Add Power9 builtins

2021-08-24 Thread Bill Schmidt via Gcc-patches

On 8/23/21 4:40 PM, Segher Boessenkool wrote:

On Thu, Jul 29, 2021 at 08:30:55AM -0500, Bill Schmidt wrote:

2021-06-15  Bill Schmidt  
* config/rs6000/rs6000-builtin-new.def: Add power9-vector, power9,
and power9-64 stanzas.
+; These things need some review to see whether they really require
+; MASK_POWERPC64.  For xsxexpdp, this seems to be fine for 32-bit,
+; because the result will always fit in 32 bits and the return
+; value is SImode; but the pattern currently requires TARGET_64BIT.

That is wrong then?  It should never have TARGET_64BIT if it isn't
addressing memory (or the like).  Did you just typo this?


Not a typo... I was referring to the condition in the following:

;; VSX Scalar Extract Exponent Double-Precision
(define_insn "xsxexpdp"
  [(set (match_operand:DI 0 "register_operand" "=r")
(unspec:DI [(match_operand:DF 1 "vsx_register_operand" "wa")]
 UNSPEC_VSX_SXEXPDP))]
  "TARGET_P9_VECTOR && TARGET_64BIT"
  "xsxexpdp %0,%x1"
  [(set_attr "type" "integer")])


+; On the other hand, xsxsigdp has a result that doesn't fit in
+; 32 bits, and the return value is DImode, so it seems that
+; TARGET_64BIT (actually TARGET_POWERPC64) is justified.  TBD. 

Because xsxsigdp needs it, it makes sense to have it for xsxexpdp as
well, or we would get a weird holey API.


OK.  Based on this, I think I will just remove the comments here.

Thanks very much for the review!

Bill



Okay for trunk (with the typo fixed if it is one).  Thanks!


Segher


[COMMITTED] Add transitive operations to the relation oracle.

2021-08-24 Thread Andrew MacLeod via Gcc-patches

This patch adds transitive relations to the oracle.

When a relation is registered with the oracle, it searches back thru the 
dominator tree for other relations which may provide a transitive 
relation and registers those. It also considers any active equivalences 
during the search.  With this, we can eliminate this call to kill() in evrp:


  if (a == x && w == z)
 if (x > y)
   if (y > z)
      {
       if (a <= w)
     kill ();
      }

I added a new test case to test various code paths, and I had to adjust 
gcc.dg/predict-1.c to disable evrp because we now eliminate one of the 
branches in the testcase that the prediction engine was looking for data on:


  for (i = 0; i < bound; i++)
    {
  if (i > bound)    // Branch is now eliminated
    global += bar (i);

Bootstraps on x86_64-pc-linux-gnu with no regressions. pushed.

Andrew

commit 675a3e40567e1d0dd6d7e7be3efab74b22731415
Author: Andrew MacLeod 
Date:   Wed Aug 18 16:36:19 2021 -0400

Add transitive operations to the relation oracle.

When registering relations in the oracle, search for other relations which
imply new transitive relations.

gcc/
* value-relation.cc (rr_transitive_table): New.
(relation_transitive): New.
(value_relation::swap): Remove.
(value_relation::apply_transitive): New.
(relation_oracle::relation_oracle): Allocate a new tmp bitmap.
(relation_oracle::register_relation): Call register_transitives.
(relation_oracle::register_transitives): New.
* value-relation.h (relation_oracle): Add new temporary bitmap and
methods.

gcc/testsuite/
* gcc.dg/predict-1.c: Disable evrp.
* gcc.dg/tree-ssa/evrp-trans.c: New.

diff --git a/gcc/testsuite/gcc.dg/predict-1.c b/gcc/testsuite/gcc.dg/predict-1.c
index 9e5605a2e84..d2e753e624e 100644
--- a/gcc/testsuite/gcc.dg/predict-1.c
+++ b/gcc/testsuite/gcc.dg/predict-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
+/* { dg-options "-O2 -fdump-tree-profile_estimate --disable-tree-evrp" } */
 
 extern int global;
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/evrp-trans.c b/gcc/testsuite/gcc.dg/tree-ssa/evrp-trans.c
new file mode 100644
index 000..8ee8e3c3f42
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/evrp-trans.c
@@ -0,0 +1,144 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-evrp" } */
+
+/* Simple tests to make sure transitives are working. */
+void keep();
+void kill();
+
+void
+f1 (int x, int y, int z)
+{
+  if (x > y)
+if (y > z)
+  {
+	if (x > z)
+	  keep ();
+	else
+	  kill ();
+  }
+}
+
+void
+f2 (int w, int x, int y, int z)
+{  
+  // Test one equivalence.
+  if (w == z)
+if (x > y)
+  if (y > z)
+	{
+	  if (x > w)
+	keep ();
+	  else
+	kill ();
+	}
+}
+
+void
+f3 (int a, int w, int x, int y, int z)
+{  
+  // Test two equivlaences.
+  if (a == x)
+if (w == z)
+  if (x > y)
+	if (y > z)
+	  {
+	if (a > w)
+	  keep ();
+	else
+	  kill ();
+	  }
+}
+
+void
+f4 (int x, int y, int z)
+{
+  // test X > Y >= Z
+  if (x > y)
+if (y >= z)
+  {
+if (x > z)
+  keep ();
+else
+  kill ();
+  }
+}
+void
+f5 (int x, int y, int z)
+{
+  // test X >= Y > Z
+  if (x >= y)
+if (y > z)
+  {
+if (x > z)
+  keep ();
+else
+  kill ();
+  }
+}
+
+void
+f6 (int x, int y, int z)
+{
+  // test X >= Y >= Z
+  if (x >= y)
+if (y >= z)
+  {
+if (x > z)
+  keep ();
+else if (x == z)
+	  keep ();
+ else
+  kill ();
+  }
+}
+
+void
+f7 (int x, int y, int z)
+{
+  // test Y <= X , Z <= Y
+  if (y <= x)
+if (z <= y)
+  {
+if (x > z)
+  keep ();
+else if (x == z)
+	  keep ();
+	else
+  kill ();
+  }
+}
+
+void
+f8 (int x, int y, int z)
+{
+  // test X >= Y, Z <= Y
+  if (x >= y)
+if (z <= y)
+  {
+if (x > z)
+  keep ();
+else if (x == z)
+	  keep ();
+	else
+  kill ();
+  }
+}
+
+void
+f9 (int x, int y, int z)
+{
+  // test Y <= X   Y >= Z
+  if (y <= x)
+if (y >= z)
+  {
+if (x > z)
+  keep ();
+else if (x == z)
+	  keep ();
+else
+  kill ();
+  }
+}
+
+/* { dg-final { scan-tree-dump-not "kill" "evrp" } }  */
+/* { dg-final { scan-tree-dump-times "keep" 13 "evrp"} } */
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index bcfe388acf1..8edd98b612a 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -112,7 +112,7 @@ relation_kind rr_intersect_table[VREL_COUNT][VREL_COUNT] = {
   { NE_EXPR, LT_EXPR, LT_EXPR, GT_EXPR, GT_EXPR, VREL_EMPTY, VREL_EMPTY, NE_EXPR } };
 
 
-// Intersect relation R! with relation R2 and return the resulting relation.
+// Intersect relation R1 with relation R2 and 

[committed] libstdc++: Fix mismatched class-key tags

2021-08-24 Thread Jonathan Wakely via Gcc-patches
Clang warns about this, but GCC doesn't (see PR c++/102036).

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* src/c++11/cxx11-shim_facets.cc: Fix mismatched class-key in
explicit instantiation definitions.

Tested powerpc64le-linux. Committed to trunk.

commit d8b7282ea27e02f687272cb8ea5f66ca900f1582
Author: Jonathan Wakely 
Date:   Tue Aug 24 12:31:06 2021

libstdc++: Fix mismatched class-key tags

Clang warns about this, but GCC doesn't (see PR c++/102036).

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* src/c++11/cxx11-shim_facets.cc: Fix mismatched class-key in
explicit instantiation definitions.

diff --git a/libstdc++-v3/src/c++11/cxx11-shim_facets.cc 
b/libstdc++-v3/src/c++11/cxx11-shim_facets.cc
index 3aa085b8aa7..ba87740d57e 100644
--- a/libstdc++-v3/src/c++11/cxx11-shim_facets.cc
+++ b/libstdc++-v3/src/c++11/cxx11-shim_facets.cc
@@ -469,21 +469,21 @@ namespace __facet_shims
}
   };
 
-template class numpunct_shim;
-template class collate_shim;
-template class moneypunct_shim;
-template class moneypunct_shim;
-template class money_get_shim;
-template class money_put_shim;
-template class messages_shim;
+template struct numpunct_shim;
+template struct collate_shim;
+template struct moneypunct_shim;
+template struct moneypunct_shim;
+template struct money_get_shim;
+template struct money_put_shim;
+template struct messages_shim;
 #ifdef _GLIBCXX_USE_WCHAR_T
-template class numpunct_shim;
-template class collate_shim;
-template class moneypunct_shim;
-template class moneypunct_shim;
-template class money_get_shim;
-template class money_put_shim;
-template class messages_shim;
+template struct numpunct_shim;
+template struct collate_shim;
+template struct moneypunct_shim;
+template struct moneypunct_shim;
+template struct money_get_shim;
+template struct money_put_shim;
+template struct messages_shim;
 #endif
 
 template


Re: [PATCH] [i386] Optimize (a & b) | (c & ~b) to vpternlog instruction.

2021-08-24 Thread Bernhard Reutner-Fischer via Gcc-patches
On Tue, 24 Aug 2021 17:53:27 +0800
Hongtao Liu via Gcc-patches  wrote:

> On Tue, Aug 24, 2021 at 9:36 AM liuhongt  wrote:
> >
> > Also optimize below 3 forms to vpternlog, op1, op2, op3 are
> > register_operand or unary_p as (not reg)

> > gcc/ChangeLog:
> >
> > PR target/101989
> > * config/i386/i386-protos.h
> > (ix86_strip_reg_or_notreg_operand): New declare.

"New declaration."

> > * config/i386/i386.c (ix86_rtx_costs): Define cost for
> > UNSPEC_VTERNLOG.

I do not see a considerable amount of VTERNLOG in the docs i have here.
Is there a P missing in vPternlog?

> > (ix86_strip_reg_or_notreg_operand): New function.  
> Push to trunk by changing ix86_strip_reg_or_notreg_operand to macro,
> function call seems too inefficient for the simple strip unary.
> > * config/i386/predicates.md (reg_or_notreg_operand): New
> > predicate.
> > * config/i386/sse.md (*_vternlog_all): New 
> > define_insn.
> > (*_vternlog_1): New pre_reload
> > define_insn_and_split.
> > (*_vternlog_2): Ditto.
> > (*_vternlog_3): Ditto.

at least the above 3 insn_and_split do have a 'p' in the md.
thanks,
> > (any_logic1,any_logic2): New code iterator.
> > (logic_op): New code attribute.
> > (ternlogsuffix): Extend to VNxDF and VNxSF.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/101989
> > * gcc.target/i386/pr101989-1.c: New test.
> > * gcc.target/i386/pr101989-2.c: New test.
> > * gcc.target/i386/avx512bw-shiftqihi-constant-1.c: Adjust testcase.
> > ---
> >  gcc/config/i386/i386-protos.h |   1 +
> >  gcc/config/i386/i386.c|  13 +
> >  gcc/config/i386/predicates.md |   7 +
> >  gcc/config/i386/sse.md| 234 ++
> >  .../i386/avx512bw-shiftqihi-constant-1.c  |   4 +-
> >  gcc/testsuite/gcc.target/i386/pr101989-1.c|  51 
> >  gcc/testsuite/gcc.target/i386/pr101989-2.c| 102 
> >  7 files changed, 410 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-2.c
> >
> > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > index 2fd13074c81..2bdaadcf4f3 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -60,6 +60,7 @@ extern rtx standard_80387_constant_rtx (int);
> >  extern int standard_sse_constant_p (rtx, machine_mode);
> >  extern const char *standard_sse_constant_opcode (rtx_insn *, rtx *);
> >  extern bool ix86_standard_x87sse_constant_load_p (const rtx_insn *, rtx);
> > +extern rtx ix86_strip_reg_or_notreg_operand (rtx);
> >  extern bool ix86_pre_reload_split (void);
> >  extern bool symbolic_reference_mentioned_p (rtx);
> >  extern bool extended_reg_mentioned_p (rtx);
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index 46844fab08f..a69225ccc81 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -5236,6 +5236,14 @@ ix86_standard_x87sse_constant_load_p (const rtx_insn 
> > *insn, rtx dst)
> >return true;
> >  }
> >
> > +/* Returns true if INSN can be transformed from a memory load
> > +   to a supported FP constant load.  */
> > +rtx
> > +ix86_strip_reg_or_notreg_operand (rtx op)
> > +{
> > +  return UNARY_P (op) ? XEXP (op, 0) : op;
> > +}
> > +
> >  /* Predicate for pre-reload splitters with associated instructions,
> > which can match any time before the split1 pass (usually combine),
> > then are unconditionally split in that pass and should not be
> > @@ -20544,6 +20552,11 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> > outer_code_i, int opno,
> >  case UNSPEC:
> >if (XINT (x, 1) == UNSPEC_TP)
> > *total = 0;
> > +  else if (XINT(x, 1) == UNSPEC_VTERNLOG)
> > +   {
> > + *total = cost->sse_op;
> > + return true;
> > +   }
> >return false;
> >
> >  case VEC_SELECT:
> > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> > index 9321f332ef9..df5acb425d4 100644
> > --- a/gcc/config/i386/predicates.md
> > +++ b/gcc/config/i386/predicates.md
> > @@ -1044,6 +1044,13 @@ (define_predicate "reg_or_pm1_operand"
> > (ior (match_test "op == const1_rtx")
> >  (match_test "op == constm1_rtx")
> >
> > +;; True for registers, or (not: registers).  Used to optimize 3-operand
> > +;; bitwise operation.
> > +(define_predicate "reg_or_notreg_operand"
> > +  (ior (match_operand 0 "register_operand")
> > +   (and (match_code "not")
> > +   (match_test "register_operand (XEXP (op, 0), mode)"
> > +
> >  ;; True if OP is acceptable as operand of DImode shift expander.
> >  (define_predicate "shiftdi_operand"
> >(if_then_else (match_test "TARGET_64BIT")
> > diff --git a/gcc/config/i386/sse.md 

Re: [PATCH] Optimize macro: make it more predictable

2021-08-24 Thread Martin Liška

On 8/24/21 14:13, Richard Biener wrote:

On Thu, Jul 1, 2021 at 3:13 PM Martin Liška  wrote:


On 10/23/20 1:47 PM, Martin Liška wrote:

Hey.


Hello.

I deferred the patch for GCC 12. Since the time, I messed up with options
I feel more familiar with the option handling. So ...



This is a follow-up of the discussion that happened in thread about 
no_stack_protector
attribute: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545916.html

The current optimize attribute works in the following way:
- 1) we take current global_options as base
- 2) maybe_default_options is called for the currently selected optimization 
level, which
   means all rules in default_options_table are executed
- 3) attribute values are applied (via decode_options)

So the step 2) is problematic: in case of -O2 -fno-omit-frame-pointer and 
__attribute__((optimize("-fno-stack-protector")))
ends basically with -O2 -fno-stack-protector because -fno-omit-frame-pointer is 
default:
  /* -O1 and -Og optimizations.  */
  { OPT_LEVELS_1_PLUS, OPT_fomit_frame_pointer, NULL, 1 },

My patch handled and the current optimize attribute really behaves that same as 
appending attribute value
to the command line. So far so good. We should also reflect that in 
documentation entry which is quite
vague right now:


^^^ all these are still valid arguments, plus I'm adding a new test-case that 
tests that.


Hey.


There is also handle_common_deferred_options that's not called so any
option processed there should
probably be excempt from being set/unset in the optimize attribute?


Looking at the handled options, they have all Defer type and not Optimization.
Thus we should be fine.





"""
The optimize attribute is used to specify that a function is to be compiled 
with different optimization options than specified on the command line.
"""


I addressed that with documentation changes, should be more clear to users. 
Moreover, I noticed that we declare 'optimize' attribute
as something not for a production use:

"The optimize attribute should be used for debugging purposes only. It is not 
suitable in production code."

Are we sure about the statement? I know that e.g. glibc uses that.


Well, given we're changing behavior now that warning looks valid ;)


Yeah! True.


I'll also note that

"The optimize attribute arguments of a function behave
as if they were added to the command line options."

is still likely untrue, the global state init is complicated ;)


Sure, but the situation should be much closer to it :) Do you have a better 
wording?






and we may want to handle -Ox in the attribute in a special way. I guess many 
macro/pragma users expect that

-O2 -ftree-vectorize and __attribute__((optimize(1))) will end with -O1 and not
with -ftree-vectorize -O1 ?


This is my older suggestion and it will likely make it even much complicated. 
So ...



As implemented your patch seems to turn it into -ftree-vectorize -O1.


Yes.


IIRC multiple optimize attributes apply
ontop of each other, and it makes sense to me that optimize (2),
optimize ("tree-vectorize") behaves the same
as optimize (2, "tree-vectorize").  I'm not sure this is still the
case after your patch?  Also consider

#pragma GCC optimize ("tree-vectorize")
void foo () { ...}

#pragma GCC optimize ("tree-loop-distribution")
void bar () {... }

I'd expect bar to have both vectorization and loop distribution
enabled? (note I didn't use push/pop here)


Yes, yes and yes. I'm going to verify it.




The situation with 'target' attribute is different. When parsing the attribute, 
we intentionally drop all existing target flags:

$ cat -n gcc/config/i386/i386-options.c
...
1245if (opt == IX86_FUNCTION_SPECIFIC_ARCH)
1246  {
1247/* If arch= is set,  clear all bits in 
x_ix86_isa_flags,
1248   except for ISA_64BIT, ABI_64, ABI_X32, and CODE16
1249   and all bits in x_ix86_isa_flags2.  */
1250opts->x_ix86_isa_flags &= (OPTION_MASK_ISA_64BIT
1251   | OPTION_MASK_ABI_64
1252   | OPTION_MASK_ABI_X32
1253   | OPTION_MASK_CODE16);
1254opts->x_ix86_isa_flags_explicit &= 
(OPTION_MASK_ISA_64BIT
1255| 
OPTION_MASK_ABI_64
1256| 
OPTION_MASK_ABI_X32
1257| 
OPTION_MASK_CODE16);
1258opts->x_ix86_isa_flags2 = 0;
1259opts->x_ix86_isa_flags2_explicit = 0;
1260  }

That seems logical because target attribute is used for e.g. ifunc 
multi-versioning and one needs
to be sure all existing ISA flags are dropped. However, I noticed clang behaves 

Re: [PATCH v2] rs6000: Add vec_unpacku_{hi,lo}_v4si

2021-08-24 Thread Segher Boessenkool
Hi Ke Wen,

On Mon, Aug 09, 2021 at 10:53:00AM +0800, Kewen.Lin wrote:
> on 2021/8/6 下午9:10, Bill Schmidt wrote:
> > On 8/4/21 9:06 PM, Kewen.Lin wrote:
> >> The existing vec_unpacku_{hi,lo} supports emulated unsigned
> >> unpacking for short and char but misses the support for int.
> >> This patch adds the support for vec_unpacku_{hi,lo}_v4si.

>   * config/rs6000/altivec.md (vec_unpacku_hi_v16qi): Remove.
>   (vec_unpacku_hi_v8hi): Likewise.
>   (vec_unpacku_lo_v16qi): Likewise.
>   (vec_unpacku_lo_v8hi): Likewise.
>   (vec_unpacku_hi_): New define_expand.
>   (vec_unpacku_lo_): Likewise.

> -(define_expand "vec_unpacku_hi_v16qi"
> -  [(set (match_operand:V8HI 0 "register_operand" "=v")
> -(unspec:V8HI [(match_operand:V16QI 1 "register_operand" "v")]
> - UNSPEC_VUPKHUB))]
> -  "TARGET_ALTIVEC"  
> -{  
> -  rtx vzero = gen_reg_rtx (V8HImode);
> -  rtx mask = gen_reg_rtx (V16QImode);
> -  rtvec v = rtvec_alloc (16);
> -  bool be = BYTES_BIG_ENDIAN;
> -   
> -  emit_insn (gen_altivec_vspltish (vzero, const0_rtx));
> -   
> -  RTVEC_ELT (v,  0) = gen_rtx_CONST_INT (QImode, be ? 16 :  7);
> -  RTVEC_ELT (v,  1) = gen_rtx_CONST_INT (QImode, be ?  0 : 16);
> -  RTVEC_ELT (v,  2) = gen_rtx_CONST_INT (QImode, be ? 16 :  6);
> -  RTVEC_ELT (v,  3) = gen_rtx_CONST_INT (QImode, be ?  1 : 16);
> -  RTVEC_ELT (v,  4) = gen_rtx_CONST_INT (QImode, be ? 16 :  5);
> -  RTVEC_ELT (v,  5) = gen_rtx_CONST_INT (QImode, be ?  2 : 16);
> -  RTVEC_ELT (v,  6) = gen_rtx_CONST_INT (QImode, be ? 16 :  4);
> -  RTVEC_ELT (v,  7) = gen_rtx_CONST_INT (QImode, be ?  3 : 16);
> -  RTVEC_ELT (v,  8) = gen_rtx_CONST_INT (QImode, be ? 16 :  3);
> -  RTVEC_ELT (v,  9) = gen_rtx_CONST_INT (QImode, be ?  4 : 16);
> -  RTVEC_ELT (v, 10) = gen_rtx_CONST_INT (QImode, be ? 16 :  2);
> -  RTVEC_ELT (v, 11) = gen_rtx_CONST_INT (QImode, be ?  5 : 16);
> -  RTVEC_ELT (v, 12) = gen_rtx_CONST_INT (QImode, be ? 16 :  1);
> -  RTVEC_ELT (v, 13) = gen_rtx_CONST_INT (QImode, be ?  6 : 16);
> -  RTVEC_ELT (v, 14) = gen_rtx_CONST_INT (QImode, be ? 16 :  0);
> -  RTVEC_ELT (v, 15) = gen_rtx_CONST_INT (QImode, be ?  7 : 16);
> -
> -  emit_insn (gen_vec_initv16qiqi (mask, gen_rtx_PARALLEL (V16QImode, v)));
> -  emit_insn (gen_vperm_v16qiv8hi (operands[0], operands[1], vzero, mask));
> -  DONE;
> -})

So I wonder if all this still generates good code.  The unspecs cannot
be optimised properly, the RTL can (in principle, anyway: it is possible
it makes more opportunities to use unpack etc. insns invisible than that
it helps over unspec.  This needs to be tested, and the usual idioms
need testcases, is that what you add here?  (/me reads on...)

> +  if (BYTES_BIG_ENDIAN)
> +emit_insn (gen_altivec_vmrgh (res, vzero, op1));
> +  else
> +emit_insn (gen_altivec_vmrgl (res, op1, vzero));

Ah, so it is *not* using unspecs?  Excellent.

Okay for trunk.  Thank you!


Segher


Ping: [PATCH] diagnostics: Support for -finput-charset [PR93067]

2021-08-24 Thread Lewis Hyatt via Gcc-patches
Hello-

I thought it might be a good time to check on this patch please? Thanks!
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html

-Lewis

On Fri, Jul 30, 2021 at 4:13 PM Lewis Hyatt  wrote:
>
> On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote:
> > On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> > > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > > > Hello-
> > > >
> > > > The attached patch addresses PR93067:
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > > >
> > > > This is similar to the patch I posted last year on the PR, with some
> > > tweaks
> > > > to make it a little simpler. Recapping some of the commentary on the
> > > PR:
> > > >
> > > > When source lines are needed for diagnostics output, they are
> > > retrieved from
> > > > the source file by the fcache infrastructure in input.c, since libcpp
> > > has
> > > > generally already forgotten them (plus not all front ends are using
> > > > libcpp). This infrastructure does not read the files in the same way
> > > as
> > > > libcpp does; in particular, it does not translate the encoding as
> > > requested
> > > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > > > present. The patch adds this ability. My thinking in deciding how to
> > > do it
> > > > was the following:
> > > >
> > > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> > > still,
> > > >   so this patch should try hard not to introduce any worse
> > > performance
> > > >   unless these things are needed.
> > > >
> > > > - It is desirable to reuse libcpp's encoding infrastructure from
> > > charset.c
> > > >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> > > it also
> > > >   has hand-coded routines for certain charsets to make sure they are
> > > >   available.)
> > > >
> > > > - There is a performance degradation required in order to make use of
> > > libcpp
> > > >   directly, because the input.c infrastructure only reads as much of
> > > the
> > > >   source file as necessary, whereas libcpp interfaces as-is require
> > > to read
> > > >   the entire file into memory.
> > > >
> > > > - It can't be quite as simple as just "only delegate to libcpp if
> > > >   -finput-charset was specified", because the stripping of the UTF-8
> > > BOM has
> > > >   to happen with or without this option.
> > > >
> > > > - So it seemed a reasonable compromise to me, if -finput-charset is
> > > >   specified, then use libcpp to convert the file, otherwise, strip
> > > the BOM
> > > >   in input.c and then process the file the same way it is done now.
> > > There's
> > > >   a little bit of leakage of charset logic from libcpp this way (for
> > > the
> > > >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> > > always
> > > >   be reading the entire file into memory, which is not a cost paid
> > > >   currently.
> > >
> > > Thanks for the patch; sorry about the delay in reviewing it.
> > >
> >
> > Thanks for the comments! Here is an updated patch that addresses your
> > feedback, plus some responses inline below.
> >
> > Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
> > the same before and after, plus 6 new PASS.
> >
> > FAIL 85 85
> > PASS 479191 479197
> > UNSUPPORTED 13664 13664
> > UNTESTED 129 129
> > XFAIL 2292 2292
> > XPASS 30 30
> >
> >
> > > This mostly seems good to me.
> > >
> > > One aspect I'm not quite convinced about is the
> > > input_cpp_context.in_use flag.  The input.c machinery is used by
> > > diagnostics, and so could be used by middle-end warnings for frontends
> > > that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> > > BOM for those, and do encoding fixups if necessary - is it more a case
> > > of initializing things to express what the expected input charset is?
> > > (since that is part of the cpp_options)
> > >
> > > c.opt has:
> > >   finput-charset=
> > >   C ObjC C++ ObjC++ Joined RejectNegative
> > >   -finput-charset=Specify the default character set for
> > > source files.
> > >
> > > I believe that D and Go are the two frontends that don't use libcpp for
> > > parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> > > considering the heritage of both).  I don't know what source encodings
> > > D supports.
> > >
> >
> > For this patch I was rather singularly focused on libcpp, so I looked
> > deeper at the other frontends now. It seems to me that there are basically
> > two questions to answer, and the three frontend styles answer this pair in
> > three different ways.
> >
> > Q1: What is the input charset?
> > A1:
> >
> > libcpp: Whatever was passed to -finput-charset (note, for Fortran,
> > -finput-charset is not supported though.)
> >
> > go: Assume UTF-8.
> >
> > D: UTF-8, UTF-16, or UTF-32 (the latter two in either
> >endianness); determined by inspecting the first bytes of the file.
> >
> > Q2: How should a 

Re: [PATCH] Optimize macro: make it more predictable

2021-08-24 Thread Richard Biener via Gcc-patches
On Thu, Jul 1, 2021 at 3:13 PM Martin Liška  wrote:
>
> On 10/23/20 1:47 PM, Martin Liška wrote:
> > Hey.
>
> Hello.
>
> I deferred the patch for GCC 12. Since the time, I messed up with options
> I feel more familiar with the option handling. So ...
>
> >
> > This is a follow-up of the discussion that happened in thread about 
> > no_stack_protector
> > attribute: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545916.html
> >
> > The current optimize attribute works in the following way:
> > - 1) we take current global_options as base
> > - 2) maybe_default_options is called for the currently selected 
> > optimization level, which
> >   means all rules in default_options_table are executed
> > - 3) attribute values are applied (via decode_options)
> >
> > So the step 2) is problematic: in case of -O2 -fno-omit-frame-pointer and 
> > __attribute__((optimize("-fno-stack-protector")))
> > ends basically with -O2 -fno-stack-protector because 
> > -fno-omit-frame-pointer is default:
> >  /* -O1 and -Og optimizations.  */
> >  { OPT_LEVELS_1_PLUS, OPT_fomit_frame_pointer, NULL, 1 },
> >
> > My patch handled and the current optimize attribute really behaves that 
> > same as appending attribute value
> > to the command line. So far so good. We should also reflect that in 
> > documentation entry which is quite
> > vague right now:
>
> ^^^ all these are still valid arguments, plus I'm adding a new test-case that 
> tests that.

There is also handle_common_deferred_options that's not called so any
option processed there should
probably be excempt from being set/unset in the optimize attribute?

> >
> > """
> > The optimize attribute is used to specify that a function is to be compiled 
> > with different optimization options than specified on the command line.
> > """
>
> I addressed that with documentation changes, should be more clear to users. 
> Moreover, I noticed that we declare 'optimize' attribute
> as something not for a production use:
>
> "The optimize attribute should be used for debugging purposes only. It is not 
> suitable in production code."
>
> Are we sure about the statement? I know that e.g. glibc uses that.

Well, given we're changing behavior now that warning looks valid ;)
I'll also note that

"The optimize attribute arguments of a function behave
as if they were added to the command line options."

is still likely untrue, the global state init is complicated ;)


> >
> > and we may want to handle -Ox in the attribute in a special way. I guess 
> > many macro/pragma users expect that
> >
> > -O2 -ftree-vectorize and __attribute__((optimize(1))) will end with -O1 and 
> > not
> > with -ftree-vectorize -O1 ?

As implemented your patch seems to turn it into -ftree-vectorize -O1.
IIRC multiple optimize attributes apply
ontop of each other, and it makes sense to me that optimize (2),
optimize ("tree-vectorize") behaves the same
as optimize (2, "tree-vectorize").  I'm not sure this is still the
case after your patch?  Also consider

#pragma GCC optimize ("tree-vectorize")
void foo () { ...}

#pragma GCC optimize ("tree-loop-distribution")
void bar () {... }

I'd expect bar to have both vectorization and loop distribution
enabled? (note I didn't use push/pop here)

> The situation with 'target' attribute is different. When parsing the 
> attribute, we intentionally drop all existing target flags:
>
> $ cat -n gcc/config/i386/i386-options.c
> ...
>1245if (opt == IX86_FUNCTION_SPECIFIC_ARCH)
>1246  {
>1247/* If arch= is set,  clear all bits in 
> x_ix86_isa_flags,
>1248   except for ISA_64BIT, ABI_64, ABI_X32, and 
> CODE16
>1249   and all bits in x_ix86_isa_flags2.  */
>1250opts->x_ix86_isa_flags &= (OPTION_MASK_ISA_64BIT
>1251   | OPTION_MASK_ABI_64
>1252   | OPTION_MASK_ABI_X32
>1253   | OPTION_MASK_CODE16);
>1254opts->x_ix86_isa_flags_explicit &= 
> (OPTION_MASK_ISA_64BIT
>1255| 
> OPTION_MASK_ABI_64
>1256| 
> OPTION_MASK_ABI_X32
>1257| 
> OPTION_MASK_CODE16);
>1258opts->x_ix86_isa_flags2 = 0;
>1259opts->x_ix86_isa_flags2_explicit = 0;
>1260  }
>
> That seems logical because target attribute is used for e.g. ifunc 
> multi-versioning and one needs
> to be sure all existing ISA flags are dropped. However, I noticed clang 
> behaves differently:
>
> $ cat hreset.c
> #pragma GCC target "arch=geode"
> #include 
> void foo(unsigned int eax)
> {
>_hreset (eax);
> }
>
> $ clang hreset.c -mhreset  -c -O2 -m32
> $ gcc hreset.c -mhreset  -c 

Re: [PATCH] tree-optimization/100089 - avoid leaving scalar if-converted code around

2021-08-24 Thread Richard Sandiford via Gcc-patches
Richard Biener via Gcc-patches  writes:
> This avoids leaving scalar if-converted code around for the case
> of BB vectorizing an if-converted loop body when using the very-cheap
> cost model.  In this case we scan not vectorized scalar stmts in
> the basic-block vectorized for COND_EXPRs and force the vectorization
> to be marked as not profitable.
>
> The patch also makes sure to always consider all BB vectorization
> subgraphs together for costing purposes when vectorizing an
> if-converted loop body.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

LGTM, although you obviously know this code better than I do.

Thanks,
Richard

> Thanks,
> Richard.
>
> 2021-08-24  Richard Biener  
>
>   PR tree-optimization/100089
>   * tree-vectorizer.h (vect_slp_bb): Rename to ...
>   (vect_slp_if_converted_bb): ... this and get the original
>   loop as new argument.
>   * tree-vectorizer.c (try_vectorize_loop_1): Revert previous fix,
>   pass original loop to vect_slp_if_converted_bb.
>   * tree-vect-slp.c (vect_bb_vectorization_profitable_p):
>   If orig_loop was passed scan the not vectorized stmts
>   for COND_EXPRs and force not profitable if found.
>   (vect_slp_region): Pass down all SLP instances to costing
>   if orig_loop was specified.
>   (vect_slp_bbs): Pass through orig_loop.
>   (vect_slp_bb): Rename to ...
>   (vect_slp_if_converted_bb): ... this and get the original
>   loop as new argument.
>   (vect_slp_function): Adjust.
> ---
>  gcc/tree-vect-slp.c   | 70 ++-
>  gcc/tree-vectorizer.c | 20 +++--
>  gcc/tree-vectorizer.h |  2 +-
>  3 files changed, 68 insertions(+), 24 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 3ed5bc1989a..8bfa45772d3 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -5287,7 +5287,8 @@ li_cost_vec_cmp (const void *a_, const void *b_)
>  
>  static bool
>  vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
> - vec slp_instances)
> + vec slp_instances,
> + loop_p orig_loop)
>  {
>slp_instance instance;
>int i;
> @@ -5324,6 +5325,30 @@ vect_bb_vectorization_profitable_p (bb_vec_info 
> bb_vinfo,
>vector_costs.safe_splice (instance->cost_vec);
>instance->cost_vec.release ();
>  }
> +  /* When we're vectorizing an if-converted loop body with the
> + very-cheap cost model make sure we vectorized all if-converted
> + code.  */
> +  bool force_not_profitable = false;
> +  if (orig_loop && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
> +{
> +  gcc_assert (bb_vinfo->bbs.length () == 1);
> +  for (gimple_stmt_iterator gsi = gsi_start_bb (bb_vinfo->bbs[0]);
> +!gsi_end_p (gsi); gsi_next ())
> + {
> +   /* The costing above left us with DCEable vectorized scalar
> +  stmts having the visited flag set.  */
> +   if (gimple_visited_p (gsi_stmt (gsi)))
> + continue;
> +
> +   if (gassign *ass = dyn_cast  (gsi_stmt (gsi)))
> + if (gimple_assign_rhs_code (ass) == COND_EXPR)
> +   {
> + force_not_profitable = true;
> + break;
> +   }
> + }
> +}
> +
>/* Unset visited flag.  */
>stmt_info_for_cost *cost;
>FOR_EACH_VEC_ELT (scalar_costs, i, cost)
> @@ -5448,9 +5473,14 @@ vect_bb_vectorization_profitable_p (bb_vec_info 
> bb_vinfo,
>return false;
>  }
>  
> +  if (dump_enabled_p () && force_not_profitable)
> +dump_printf_loc (MSG_NOTE, vect_location,
> +  "not profitable because of unprofitable if-converted "
> +  "scalar code\n");
> +
>scalar_costs.release ();
>vector_costs.release ();
> -  return true;
> +  return !force_not_profitable;
>  }
>  
>  /* qsort comparator for lane defs.  */
> @@ -5895,7 +5925,8 @@ vect_slp_analyze_bb_1 (bb_vec_info bb_vinfo, int 
> n_stmts, bool ,
>  
>  static bool
>  vect_slp_region (vec bbs, vec datarefs,
> -  vec *dataref_groups, unsigned int n_stmts)
> +  vec *dataref_groups, unsigned int n_stmts,
> +  loop_p orig_loop)
>  {
>bb_vec_info bb_vinfo;
>auto_vector_modes vector_modes;
> @@ -5944,7 +5975,9 @@ vect_slp_region (vec bbs, 
> vec datarefs,
> vect_location = instance->location ();
> if (!unlimited_cost_model (NULL)
> && !vect_bb_vectorization_profitable_p
> - (bb_vinfo, instance->subgraph_entries))
> + (bb_vinfo,
> +  orig_loop ? BB_VINFO_SLP_INSTANCES (bb_vinfo)
> +  : instance->subgraph_entries, orig_loop))
>   {
> for (slp_instance inst : instance->subgraph_entries)
>   if (inst->kind == slp_inst_kind_bb_reduc)
> @@ -5965,7 +5998,9 @@ vect_slp_region (vec bbs, 
> 

[PATCH 2/4] ipa-cp: Propagation boost for recursion generated values

2021-08-24 Thread Martin Jambor
Recursive call graph edges, even when they are hot and important for
the compiled program, can never have frequency bigger than one, even
when the actual time savings in the next recursion call are not
realized just once but depend on the depth of recursion.  The current
IPA-CP effect propagation code did not take that into account and just
used the frequency, thus severely underestimating the effect.

This patch artificially boosts values taking part in such calls.  If a
value feeds into itself through a recursive call, the frequency of the
edge is multiplied by a parameter with default value of 6, basically
assuming that the recursion will take place 6 times.  This value can
of course be subject to change.

Moreover, values which do not feed into themselves but which were
generated for a self-recursive call with an arithmetic
pass-function (aka the 548.exchange "hack" which however is generally
applicable for recursive functions which count the recursion depth in
a parameter) have the edge frequency multiplied as many times as there
are generated values in the chain.  In essence, we will assume they
are all useful.

This patch partially fixes the current situation when we fail to
optimize 548.exchange with PGO.  In the benchmark one recursive edge
count overwhelmingly dominates all other counts in the program and so
we fail to perform the first cloning (for the nonrecursive entry call)
because it looks totally insignificant.

gcc/ChangeLog:

2021-07-16  Martin Jambor  

* params.opt (ipa-cp-recursive-freq-factor): New.
* ipa-cp.c (ipcp_value): Switch to inline initialization.  New members
scc_no, self_recursion_generated_level, same_scc and
self_recursion_generated_p.
(ipcp_lattice::add_value): Replaced parameter unlimited with
same_lat_gen_level, usit it determine limit of values and store it to
the value.
(ipcp_lattice::print): Dump the new fileds.
(allocate_and_init_ipcp_value): Take same_lat_gen_level as a new
parameter and store it to the new value.
(self_recursively_generated_p): Removed.
(propagate_vals_across_arith_jfunc): Use self_recursion_generated_p
instead of self_recursively_generated_p, store self generation level
to such values.
(value_topo_info::add_val): Set scc_no.
(value_topo_info::propagate_effects): Multiply frequencies of
recursively feeding values and self generated values by appropriate
new factors.
---
 gcc/ipa-cp.c   | 161 -
 gcc/params.opt |   4 ++
 2 files changed, 84 insertions(+), 81 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 55b9216337f..b987d975793 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -184,30 +184,52 @@ public:
   /* The actual value for the given parameter.  */
   valtype value;
   /* The list of sources from which this value originates.  */
-  ipcp_value_source  *sources;
+  ipcp_value_source  *sources = nullptr;
   /* Next pointers in a linked list of all values in a lattice.  */
-  ipcp_value *next;
+  ipcp_value *next = nullptr;
   /* Next pointers in a linked list of values in a strongly connected component
  of values. */
-  ipcp_value *scc_next;
+  ipcp_value *scc_next = nullptr;
   /* Next pointers in a linked list of SCCs of values sorted topologically
  according their sources.  */
-  ipcp_value  *topo_next;
+  ipcp_value  *topo_next = nullptr;
   /* A specialized node created for this value, NULL if none has been (so far)
  created.  */
-  cgraph_node *spec_node;
+  cgraph_node *spec_node = nullptr;
   /* Depth first search number and low link for topological sorting of
  values.  */
-  int dfs, low_link;
+  int dfs = 0;
+  int low_link = 0;
+  /* SCC number to identify values which recursively feed into each other.
+ Values in the same SCC have the same SCC number.  */
+  int scc_no = 0;
+  /* Non zero if the value is generated from another value in the same lattice
+ for a self-recursive call, the actual number is how many times the
+ operation has been performed.  In the unlikely event of the value being
+ present in two chains fo self-recursive value generation chains, it is the
+ maximum.  */
+  unsigned self_recursion_generated_level = 0;
   /* True if this value is currently on the topo-sort stack.  */
-  bool on_stack;
-
-  ipcp_value()
-: sources (0), next (0), scc_next (0), topo_next (0),
-  spec_node (0), dfs (0), low_link (0), on_stack (false) {}
+  bool on_stack = false;
 
   void add_source (cgraph_edge *cs, ipcp_value *src_val, int src_idx,
   HOST_WIDE_INT offset);
+
+  /* Return true if both THIS value and O feed into each other.  */
+
+  bool same_scc (const ipcp_value *o)
+  {
+return o->scc_no == scc_no;
+  }
+
+/* Return true, if a this value has been generated for a self-recursive call as
+   a result of an arithmetic pass-through jump-function acting on a 

[PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation

2021-08-24 Thread Martin Jambor
IPA-CP does not do a reasonable job when it is updating profile counts
after it has created clones of recursive functions.  This patch
addresses that by:

1. Only updating counts for special-context clones.  When a clone is
created for all contexts, the original is going to be dead and the
cgraph machinery has copied counts to the new node which is the right
thing to do.  Therefore updating counts has been moved from
create_specialized_node to decide_about_value and
decide_whether_version_node.

2. The current profile updating code artificially increased the assumed
old count when the sum of counts of incoming edges to both the
original and new node were bigger than the count of the original
node.  This always happened when self-recursive edge from the clone
was also redirected to the clone because both the original edge and
its clone had original high counts.  This clutch was removed and
replaced by the next point.

3. When cloning creates also redirects a self-recursive clone to the
clone itself, new logic has been added to divide the counts brought by
such recursive edges between the original node and the clone.  This is
impossible to do well without special knowledge about the function and
which non-recursive entry calls are responsible for what portion of
recursion depth, so the approach taken is rather crude.

For non-local nodes which can have unknown callers, the algorithm just
takes half of the counts - we may decide that taking just a third or
some other portion is more reasonable, but I do not think we can
attempt anything more clever.

For local nodes, we detect the case when the original node is never
called (in the training run at least) with another value and if so,
steal all its counts like if it was dead.  If that is not the case, we
try to divide the count brought by recursive edges (or rather not
brought by direct edges) proportionally to the counts brought by
non-recursive edges - but with artificial limits in place so that we
do not take too many or too few, because that was happening with
detrimental effect in mcf_r.

4. When cloning creates extra clones for values brought by a formerly
self-recursive edge with an arithmetic pass-through jump function on
it, such as it does in exchange2_r, all such clones are processed at
once rather than one after another.  The counts of all such nodes are
distributed evenly (modulo even-formerly-non-recursive-edges) and the
whole situation is then fixed up so that the edge counts fit.  This is
what new function update_counts_for_self_gen_clones does.

5. When values brought by a formerly self-recursive edge with an
arithmetic pass-through jump function on it are evaluated by
heuristics which assumes vast majority of node counts are result of
recursive calls and so we simply divide those with the number of
clones there would be if we created another one.

6. The mechanisms in init_caller_stats and gather_caller_stats and
get_info_about_necessary_edges was enhanced to gather data required
for the above and a missing check not to count dead incoming edges was
also added.

gcc/ChangeLog:

2021-08-23  Martin Jambor  

* ipa-cp.c (struct caller_statistics): New fields rec_count_sum,
n_nonrec_calls and itself, document all fields.
(init_caller_stats): Initialize the above new fields.
(gather_caller_stats): Gather self-recursive counts and calls number.
(get_info_about_necessary_edges): Gather counts of self-recursive and
other edges bringing in the requested value separately.
(dump_profile_updates): Rework to dump info about a single node only.
(lenient_count_portion_handling): New function.
(struct gather_other_count_struct): New type.
(gather_count_of_non_rec_edges): New function.
(struct desc_incoming_count_struct): New type.
(analyze_clone_icoming_counts): New function.
(adjust_clone_incoming_counts): Likewise.
(update_counts_for_self_gen_clones): Likewise.
(update_profiling_info): Rewritten.
(update_specialized_profile): Adjust call to dump_profile_updates.
(create_specialized_node): Do not update profiling info.
(decide_about_value): New parameter self_gen_clones, either push new
clones into it or updat their profile counts.  For self-recursively
generated values, use a portion of the node count instead of count
from self-recursive edges to estimate goodness.
(decide_whether_version_node): Gather clones for self-generated values
in a new vector, update their profiles at once at the end.
---
 gcc/ipa-cp.c | 543 +++
 1 file changed, 457 insertions(+), 86 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index b987d975793..53cca7aa804 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node)
 
 struct caller_statistics
 {
+  /* If requested (see below), 

[PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on

2021-08-24 Thread Martin Jambor
When profile feedback is available, IPA-CP takes the count of the
hottest node and then evaluates all call contexts relative to it.
This means that typically almost no clones for specialized contexts
are ever created because the maximum is some special function, called
from everywhere (that is likely to get inlined anyway) and all the
examined edges look cold compared to it.

This patch changes the selection.  It simply sorts counts of all edges
eligible for cloning in a vector and then picks the count in 90th
percentile (the actual number is configurable via a parameter).

I also tried more complex approaches which were summing the counts and
picking the edge which together with all hotter edges accounted for a
given portion of the total sum of all edge counts.  But first it was
not apparently clear to me that they make more logical sense that the
simple method and practically I always also had to ignore a few
percent of the hottest edges with really extreme counts (looking at
bash and python).  And when I had to do that anyway, it seemed simpler
to just "ignore" more and take the first non-ignored count as the
base.

Nevertheless, if people think some more sophisticated method should be
used anyway, I am willing to be persuaded.  But this patch is a clear
improvement over the current situation.

gcc/ChangeLog:

2021-08-23  Martin Jambor  

* params.opt (param_ipa_cp_profile_count_base): New parameter.
* ipa-cp.c (max_count): Replace with base_count, replace all
occurrences too, unless otherwise stated.
(ipcp_cloning_candidate_p): identify mostly-directly called
functions based on their counts, not max_count.
(compare_edge_profile_counts): New function.
(ipcp_propagate_stage): Instead of setting max_count, find the
appropriate edge count in a sorted vector of counts of eligible
edges and make it the base_count.
---
 gcc/ipa-cp.c   | 82 +-
 gcc/params.opt |  4 +++
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 53cca7aa804..6ab74f61e83 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -400,9 +400,9 @@ object_allocator > ipcp_sources_pool
 object_allocator ipcp_agg_lattice_pool
   ("IPA_CP aggregate lattices");
 
-/* Maximal count found in program.  */
+/* Base count to use in heuristics when using profile feedback.  */
 
-static profile_count max_count;
+static profile_count base_count;
 
 /* Original overall size of the program.  */
 
@@ -809,7 +809,8 @@ ipcp_cloning_candidate_p (struct cgraph_node *node)
   /* When profile is available and function is hot, propagate into it even if
  calls seems cold; constant propagation can improve function's speed
  significantly.  */
-  if (max_count > profile_count::zero ())
+  if (stats.count_sum > profile_count::zero ()
+  && node->count.ipa ().initialized_p ())
 {
   if (stats.count_sum > node->count.ipa ().apply_scale (90, 100))
{
@@ -3310,10 +3311,10 @@ good_cloning_opportunity_p (struct cgraph_node *node, 
sreal time_benefit,
 
   ipa_node_params *info = ipa_node_params_sum->get (node);
   int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
-  if (max_count > profile_count::zero ())
+  if (base_count > profile_count::zero ())
 {
 
-  sreal factor = count_sum.probability_in (max_count).to_sreal ();
+  sreal factor = count_sum.probability_in (base_count).to_sreal ();
   sreal evaluation = (time_benefit * factor) / size_cost;
   evaluation = incorporate_penalties (node, info, evaluation);
   evaluation *= 1000;
@@ -3950,6 +3951,21 @@ value_topo_info::propagate_effects ()
 }
 }
 
+/* Callback for qsort to sort counts of all edges.  */
+
+static int
+compare_edge_profile_counts (const void *a, const void *b)
+{
+  const profile_count *cnt1 = (const profile_count *) a;
+  const profile_count *cnt2 = (const profile_count *) b;
+
+  if (*cnt1 < *cnt2)
+return 1;
+  if (*cnt1 > *cnt2)
+return -1;
+  return 0;
+}
+
 
 /* Propagate constants, polymorphic contexts and their effects from the
summaries interprocedurally.  */
@@ -3962,8 +3978,10 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
   if (dump_file)
 fprintf (dump_file, "\n Propagating constants:\n\n");
 
-  max_count = profile_count::uninitialized ();
+  base_count = profile_count::uninitialized ();
 
+  bool compute_count_base = false;
+  unsigned base_count_pos_percent = 0;
   FOR_EACH_DEFINED_FUNCTION (node)
   {
 if (node->has_gimple_body_p ()
@@ -3981,9 +3999,57 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
 ipa_size_summary *s = ipa_size_summaries->get (node);
 if (node->definition && !node->alias && s != NULL)
   overall_size += s->self_size;
-max_count = max_count.max (node->count.ipa ());
+if (node->count.ipa ().initialized_p ())
+  {
+   compute_count_base = true;
+   unsigned pos_percent = 

[PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions

2021-08-24 Thread Martin Jambor
To verify other changes in the patch series, I have been searching for
"Invalid sum of caller counts" string in symtab dump but found that
there are false warnings about functions which have their body removed
because they are now unreachable.  Those are of course invalid and so
this patches avoids checking such cgraph_nodes.

gcc/ChangeLog:

2021-08-20  Martin Jambor  

* cgraph.c (cgraph_node::dump): Do not check caller count sums if
the body has been removed.  Remove trailing whitespace.
---
 gcc/cgraph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 8f3af003f2a..de078653781 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2236,7 +2236,7 @@ cgraph_node::dump (FILE *f)
 }
   fprintf (f, "\n");
 
-  if (count.ipa ().initialized_p ())
+  if (!body_removed && count.ipa ().initialized_p ())
 {
   bool ok = true;
   bool min = false;
@@ -2245,7 +2245,7 @@ cgraph_node::dump (FILE *f)
   FOR_EACH_ALIAS (this, ref)
if (dyn_cast  (ref->referring)->count.initialized_p ())
  sum += dyn_cast  (ref->referring)->count.ipa ();
-  
+
   if (inlined_to
  || (symtab->state < EXPANSION
  && ultimate_alias_target () == this && only_called_directly_p ()))
-- 
2.32.0



[PATCH 0/4] IPA-CP profile feedback handling fixes

2021-08-24 Thread Martin Jambor
Hi,

this patch set addresses a number of shortcomings of IPA-CP when it
has profile feedback data at its disposal.  While at this point it is
mostly RFC material because I expect Honza will correct many of the
places where I use a wrong method of profile_count and should be using
some slightly different one, I do want to turn it into material I can
push to master rather quickly.

Most of the changes were motivated by SPEC 2017 exchange2 benchmark,
which exposes the problems nicely, is now 22% slower with profile
feedback, and this patch fixes that.  Overall, the patch set does not
have any effect on SPEC 2017 FPrate. SPEC 2017 INTrate results, as
quickly gathered on my znver2 desktop overnight (1 run only), are:

PGO only:

  | Benchmark   | Trunk | Rate | Patch |  % | Rate |
  |-+---+--+---++--|
  | 500.perlbench_r |   236 | 6.74 |   239 |  +1.27 | 6.67 |
  | 502.gcc_r   |   160 | 8.85 |   159 |  -0.62 | 8.89 |
  | 505.mcf_r   |   227 | 7.11 |   228 |  +0.44 | 7.08 |
  | 520.omnetpp_r   |   314 | 4.18 |   311 |  -0.96 | 4.21 |
  | 523.xalancbmk_r |   195 | 5.41 |   199 |  +2.05 | 5.32 |
  | 525.x264_r  |   129 | 13.6 |   131 |  +1.55 | 13.4 |
  | 531.deepsjeng_r |   230 | 4.98 |   230 |  +0.00 | 4.98 |
  | 541.leela_r |   353 | 4.70 |   353 |  +0.00 | 4.69 |
  | 548.exchange2_r |   249 | 10.5 |   189 | -24.10 | 13.8 |
  | 557.xz_r|   246 | 4.39 |   248 |  +0.81 | 4.36 |
  |-+---+--+---++--|
  | Geomean |   | 6.53 |   || 6.68 |

I have re-run 523.xalancbmk_r and the regression seems to be noise.

PGO+LTO:

| Benchmark   | Trunk | Rate | Patch |  % |  Rate |
|-+---+--+---++---|
| 500.perlbench_r |   231 | 6.88 |   230 |  -0.43 |  6.93 |
| 502.gcc_r   |   149 | 9.51 |   149 |  +0.00 |  9.53 |
| 505.mcf_r   |   208 | 7.76 |   202 |  -2.88 |  7.98 |
| 520.omnetpp_r   |   282 | 4.64 |   282 |  +0.00 |  4.65 |
| 523.xalancbmk_r |   185 | 5.70 |   188 |  +1.62 |  5.63 |
| 525.x264_r  |   133 | 13.1 |   134 |  +0.75 | 13.00 |
| 531.deepsjeng_r |   190 | 6.04 |   185 |  -2.63 |  6.20 |
| 541.leela_r |   298 | 5.56 |   298 |  +0.00 |  5.57 |
| 548.exchange2_r |   247 | 10.6 |   193 | -21.86 | 13.60 |
| 557.xz_r|   250 | 4.32 |   251 |  +0.40 |  4.31 |
|-+---+--+---++---|
| Geomean |   | 6.97 |   ||  7.18 |

I have re-run 531.deepsjeng_r and 505.mcf_r and while the former
improvement seems to be noise, the latter is consistent and even
explainable by more cloning of spec_qsort, which is the result of the
last patch and saner updates of counts of call graph edges from these
clones.

In both cases the exchange2 improvement is achieved by:

1) The second patch which makes sure that IPA-CP creates a clone for
   the first value, even though the non-recursive edge bringing the
   value is quite cold, because it enables specializing for much
   hotter contexts, and

2) the third patch which changes how values resulting from arithmetic
   jump functions on self-recursive edges are evaluated and then
   modifies the profile count of the whole resulting call graph part.

The final patch is not necessary to address the exchange2 regression.

I have bootstrapped and LTO-profile-bootstrapped and tested the whole
patch series on x86_64-linux without any issues.  As written above,
I'll be happy to address any comments/concerns so that something like
this can be pushed to master soon.

Thanks,

Martin


Martin Jambor (4):
  cgraph: Do not warn about caller count mismatches of removed functions
  ipa-cp: Propagation boost for recursion generated values
  ipa-cp: Fix updating of profile counts and self-gen value evaluation
  ipa-cp: Select saner profile count to base heuristics on

 gcc/cgraph.c   |   4 +-
 gcc/ipa-cp.c   | 786 ++---
 gcc/params.opt |   8 +
 3 files changed, 621 insertions(+), 177 deletions(-)

-- 
2.32.0


Re: Host and offload targets have no common meaning of address spaces (was: [ping] Re-unify 'omp_build_component_ref' and 'oacc_build_component_ref')

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, Aug 24, 2021 at 12:23 PM Thomas Schwinge
 wrote:
>
> Hi!
>
> On 2021-08-19T22:13:56+0200, I wrote:
> > On 2021-08-16T10:21:04+0200, Jakub Jelinek  wrote:
> >> On Mon, Aug 16, 2021 at 10:08:42AM +0200, Thomas Schwinge wrote:
> > |> Concerning the current 'gcc/omp-low.c:omp_build_component_ref', for the
> > |> current set of offloading testcases, we never see a
> > |> '!ADDR_SPACE_GENERIC_P' there, so the address space handling doesn't seem
> > |> to be necessary there (but also won't do any harm: no-op).
> >>
> >> Are you sure this can't trigger?
> >> Say
> >> extern int __seg_fs a;
> >>
> >> void
> >> foo (void)
> >> {
> >>   #pragma omp parallel private (a)
> >>   a = 2;
> >> }
> >
> > That test case doesn't run into 'omp_build_component_ref' at all,
> > but [I've pushed an altered and extended variant that does],
> > "Add 'libgomp.c/address-space-1.c'".
> >
> > In this case, 'omp_build_component_ref' called via host compilation
> > 'pass_lower_omp', it's the 'field_type' that has 'address-space-1', not
> > 'obj_type', so indeed Kwok's new code is a no-op:
> >
> > (gdb) call debug_tree(field_type)
> >   > type 
> >> I think keeping the qual addr space here is the wrong thing to do,
> >> it should keep the other quals and clear the address space instead,
> >> the whole struct is going to be in generic addres space, isn't it?
> >
> > Correct for 'omp_build_component_ref' called via host compilation
> > 'pass_lower_omp'
>
> > However, regarding the former comment -- shouldn't we force generic
> > address space for all 'tree' types read in via LTO streaming for
> > offloading compilation?  I assume that (in the general case) address
> > spaces are never compatible between host and offloading compilation?
> > For [...] "Add 'libgomp.c/address-space-1.c'", propagating the
> > '__seg_fs' address space across the offloading boundary (assuming I did
> > interpret the dumps correctly) doesn't seem to cause any problems
>
> As I found later, actually the 'address-space-1' per host '__seg_fs' does
> cause the "Intel MIC (emulated) offloading execution failure"
> mentioned/XFAILed for 'libgomp.c/address-space-1.c': SIGSEGV, like
> (expected) for host execution.  For GCN offloading target, it maps to
> GCN 'ADDR_SPACE_FLAT' which apparently doesn't cause any ill effects (for
> that simple test case).  The nvptx offloading target doesn't consider
> address spaces at all.
>
> Is the attached "Host and offload targets have no common meaning of
> address spaces" OK to push?
>
>
> Then, is that the way to do this, or should we add in
> 'gcc/tree-streamer-out.c:pack_ts_base_value_fields':
>
> if (lto_stream_offload_p)
>   gcc_assert (ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (expr)));
>
> ..., and elsewhere sanitize this for offloading compilation?  Jakub's
> suggestion above, regarding 'gcc/omp-low.c:omp_build_component_ref':
>
> | I think keeping the qual addr space here is the wrong thing to do,
> | it should keep the other quals and clear the address space instead
>
> But it's not obvious to me that indeed this is the one place where this
> would need to be done?  (It ought to work for
> 'libgomp.c/address-space-1.c', and any other occurrences would run into
> the 'assert', so that ought to be "fine", though?)
>
>
> And, should we have a new hook
> 'void targetm.addr_space.validate (addr_space_t as)' (better name?),
> called via 'gcc/emit-rtl.c:set_mem_attrs' (only? -- assuming this is the
> appropriate canonic function where address space use is observed?), to
> make sure that the requested 'as' is valid for the target?
> 'default_addr_space_validate' would refuse everything but
> 'ADDR_SPACE_GENERIC_P (as)'; this hook would need implementing for all
> handful of targets making use of address spaces (supposedly matching the
> logic how they call 'c_register_addr_space'?).  (The closest existing
> hook seems to be 'targetm.addr_space.diagnose_usage', only defined for
> AVR, and called from "the front ends" (C only).)

Are address-spaces to be used in any way for OpenMP offload code?  That is,
does the OpenMP standard talk about them and how to remap things?  I'd
say I agree that any host address-space should go away when the corresponding
data is offloaded and in case OpenMP allows to specify a target address-space
that would need to be instantiated in a way so the LTO streaming knows about
a mapping from the host to the target representation.

Richard.

>
> Grüße
>  Thomas
>
>
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955


Re: [PATCH] Make sure we're playing with integral modes before call extract_integral_bit_field.

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, Aug 24, 2021 at 11:38 AM Hongtao Liu  wrote:
>
> On Tue, Aug 24, 2021 at 5:40 PM Hongtao Liu  wrote:
> >
> > On Tue, Aug 17, 2021 at 9:52 AM Hongtao Liu  wrote:
> > >
> > > On Mon, Aug 9, 2021 at 4:34 PM Hongtao Liu  wrote:
> > > >
> > > > On Fri, Aug 6, 2021 at 7:27 PM Richard Biener via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > On Fri, Aug 6, 2021 at 11:05 AM Richard Sandiford
> > > > >  wrote:
> > > > > >
> > > > > > Richard Biener via Gcc-patches  writes:
> > > > > > > On Fri, Aug 6, 2021 at 5:32 AM liuhongt  
> > > > > > > wrote:
> > > > > > >>
> > > > > > >> Hi:
> > > > > > >> ---
> > > > > > >> OK, I think sth is amiss here upthread.  insv/extv do look like 
> > > > > > >> they
> > > > > > >> are designed
> > > > > > >> to work on integer modes (but docs do not say anything about 
> > > > > > >> this here).
> > > > > > >> In fact the caller of extract_bit_field_using_extv is named
> > > > > > >> extract_integral_bit_field.  Of course nothing seems to check 
> > > > > > >> what kind of
> > > > > > >> modes we're dealing with, but we're for example happily doing
> > > > > > >> expand_shift in 'mode'.  In the extract_integral_bit_field call 
> > > > > > >> 'mode' is
> > > > > > >> some integer mode and op0 is HFmode?  From the above I get it's
> > > > > > >> the other way around?  In that case we should wrap the
> > > > > > >> call to extract_integral_bit_field, extracting in an integer 
> > > > > > >> mode with the
> > > > > > >> same size as 'mode' and then converting the result as (subreg:HF 
> > > > > > >> (reg:HI ...)).
> > > > > > >> ---
> > > > > > >>   This is a separate patch as a follow up of upper comments.
> > > > > > >>
> > > > > > >> gcc/ChangeLog:
> > > > > > >>
> > > > > > >> * expmed.c (extract_bit_field_1): Wrap the call to
> > > > > > >> extract_integral_bit_field, extracting in an integer 
> > > > > > >> mode with
> > > > > > >> the same size as 'tmode' and then converting the result
> > > > > > >> as (subreg:tmode (reg:imode)).
> > > > > > >>
> > > > > > >> gcc/testsuite/ChangeLog:
> > > > > > >> * gcc.target/i386/float16-5.c: New test.
> > > > > > >> ---
> > > > > > >>  gcc/expmed.c  | 19 
> > > > > > >> +++
> > > > > > >>  gcc/testsuite/gcc.target/i386/float16-5.c | 12 
> > > > > > >>  2 files changed, 31 insertions(+)
> > > > > > >>  create mode 100644 gcc/testsuite/gcc.target/i386/float16-5.c
> > > > > > >>
> > > > > > >> diff --git a/gcc/expmed.c b/gcc/expmed.c
> > > > > > >> index 3143f38e057..72790693ef0 100644
> > > > > > >> --- a/gcc/expmed.c
> > > > > > >> +++ b/gcc/expmed.c
> > > > > > >> @@ -1850,6 +1850,25 @@ extract_bit_field_1 (rtx str_rtx, 
> > > > > > >> poly_uint64 bitsize, poly_uint64 bitnum,
> > > > > > >>op0_mode = opt_scalar_int_mode ();
> > > > > > >>  }
> > > > > > >>
> > > > > > >> +  /* Make sure we are playing with integral modes.  Pun with 
> > > > > > >> subregs
> > > > > > >> + if we aren't. When tmode is HFmode, op0 is SImode, there 
> > > > > > >> will be ICE
> > > > > > >> + in extract_integral_bit_field.  */
> > > > > > >> +  if (int_mode_for_mode (tmode).exists ()
> > > > > > >
> > > > > > > check !INTEGRAL_MODE_P (tmode) before, that should be slightly
> > > > > > > cheaper.  Then imode should always be != tmode.  Maybe
> > > > > > > even GET_MDOE_CLASS (tmode) != MODE_INT since I'm not sure
> > > > > > > how it behaves for composite modes.
> > > > > > >
> > > > > > > Of course the least surprises would happen when we restrict this
> > > > > > > to FLOAT_MODE_P (tmode).
> > > > > > >
> > > > > > > Richard - any preferences?
> > > > > >
> > > > > > If the bug is that extract_integral_bit_field is being called with
> > > > > > a non-integral mode parameter, then it looks odd that we can still
> > > > > > fall through to it without an integral mode (when exists is false).
> > > > > >
> > > > > > If calling extract_integral_bit_field without an integral mode is
> > > > > > a bug then I think we should have:
> > > > > >
> > > > > >   int_mode_for_mode (mode).require ()
> > > > > >
> > > > > > whenever mode is not already 
> > > > > > SCALAR_INT_MODE_P/is_a.
> > > > > > Ideally we'd make the mode parameter scalar_int_mode too.
> > > > > >
> > > > > > extract_integral_bit_field currently has:
> > > > > >
> > > > > >   /* Find a correspondingly-sized integer field, so we can apply
> > > > > >  shifts and masks to it.  */
> > > > > >   scalar_int_mode int_mode;
> > > > > >   if (!int_mode_for_mode (tmode).exists (_mode))
> > > > > > /* If this fails, we should probably push op0 out to memory and 
> > > > > > then
> > > > > >do a load.  */
> > > > > > int_mode = int_mode_for_mode (mode).require ();
> > > > > >
> > > > > > which would seem to be redundant after this change.
> > > > >
> > > > > I'm not sure what exactly the bug is, but extract_integral_bit_field 
> > > > > ends
> > > > > up creating a lowpart 

[PATCH] tree-optimization/100089 - avoid leaving scalar if-converted code around

2021-08-24 Thread Richard Biener via Gcc-patches
This avoids leaving scalar if-converted code around for the case
of BB vectorizing an if-converted loop body when using the very-cheap
cost model.  In this case we scan not vectorized scalar stmts in
the basic-block vectorized for COND_EXPRs and force the vectorization
to be marked as not profitable.

The patch also makes sure to always consider all BB vectorization
subgraphs together for costing purposes when vectorizing an
if-converted loop body.

Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

Thanks,
Richard.

2021-08-24  Richard Biener  

PR tree-optimization/100089
* tree-vectorizer.h (vect_slp_bb): Rename to ...
(vect_slp_if_converted_bb): ... this and get the original
loop as new argument.
* tree-vectorizer.c (try_vectorize_loop_1): Revert previous fix,
pass original loop to vect_slp_if_converted_bb.
* tree-vect-slp.c (vect_bb_vectorization_profitable_p):
If orig_loop was passed scan the not vectorized stmts
for COND_EXPRs and force not profitable if found.
(vect_slp_region): Pass down all SLP instances to costing
if orig_loop was specified.
(vect_slp_bbs): Pass through orig_loop.
(vect_slp_bb): Rename to ...
(vect_slp_if_converted_bb): ... this and get the original
loop as new argument.
(vect_slp_function): Adjust.
---
 gcc/tree-vect-slp.c   | 70 ++-
 gcc/tree-vectorizer.c | 20 +++--
 gcc/tree-vectorizer.h |  2 +-
 3 files changed, 68 insertions(+), 24 deletions(-)

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 3ed5bc1989a..8bfa45772d3 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5287,7 +5287,8 @@ li_cost_vec_cmp (const void *a_, const void *b_)
 
 static bool
 vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
-   vec slp_instances)
+   vec slp_instances,
+   loop_p orig_loop)
 {
   slp_instance instance;
   int i;
@@ -5324,6 +5325,30 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
   vector_costs.safe_splice (instance->cost_vec);
   instance->cost_vec.release ();
 }
+  /* When we're vectorizing an if-converted loop body with the
+ very-cheap cost model make sure we vectorized all if-converted
+ code.  */
+  bool force_not_profitable = false;
+  if (orig_loop && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
+{
+  gcc_assert (bb_vinfo->bbs.length () == 1);
+  for (gimple_stmt_iterator gsi = gsi_start_bb (bb_vinfo->bbs[0]);
+  !gsi_end_p (gsi); gsi_next ())
+   {
+ /* The costing above left us with DCEable vectorized scalar
+stmts having the visited flag set.  */
+ if (gimple_visited_p (gsi_stmt (gsi)))
+   continue;
+
+ if (gassign *ass = dyn_cast  (gsi_stmt (gsi)))
+   if (gimple_assign_rhs_code (ass) == COND_EXPR)
+ {
+   force_not_profitable = true;
+   break;
+ }
+   }
+}
+
   /* Unset visited flag.  */
   stmt_info_for_cost *cost;
   FOR_EACH_VEC_ELT (scalar_costs, i, cost)
@@ -5448,9 +5473,14 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
   return false;
 }
 
+  if (dump_enabled_p () && force_not_profitable)
+dump_printf_loc (MSG_NOTE, vect_location,
+"not profitable because of unprofitable if-converted "
+"scalar code\n");
+
   scalar_costs.release ();
   vector_costs.release ();
-  return true;
+  return !force_not_profitable;
 }
 
 /* qsort comparator for lane defs.  */
@@ -5895,7 +5925,8 @@ vect_slp_analyze_bb_1 (bb_vec_info bb_vinfo, int n_stmts, 
bool ,
 
 static bool
 vect_slp_region (vec bbs, vec datarefs,
-vec *dataref_groups, unsigned int n_stmts)
+vec *dataref_groups, unsigned int n_stmts,
+loop_p orig_loop)
 {
   bb_vec_info bb_vinfo;
   auto_vector_modes vector_modes;
@@ -5944,7 +5975,9 @@ vect_slp_region (vec bbs, 
vec datarefs,
  vect_location = instance->location ();
  if (!unlimited_cost_model (NULL)
  && !vect_bb_vectorization_profitable_p
-   (bb_vinfo, instance->subgraph_entries))
+   (bb_vinfo,
+orig_loop ? BB_VINFO_SLP_INSTANCES (bb_vinfo)
+: instance->subgraph_entries, orig_loop))
{
  for (slp_instance inst : instance->subgraph_entries)
if (inst->kind == slp_inst_kind_bb_reduc)
@@ -5965,7 +5998,9 @@ vect_slp_region (vec bbs, 
vec datarefs,
 "using SLP\n");
  vectorized = true;
 
- vect_schedule_slp (bb_vinfo, instance->subgraph_entries);
+ vect_schedule_slp (bb_vinfo,
+orig_loop ? 

Re: [PATCH] Optimize macro: make it more predictable

2021-08-24 Thread Martin Liška

PING^2

On 8/10/21 17:52, Martin Liška wrote:

PING^1

On 7/1/21 3:13 PM, Martin Liška wrote:

On 10/23/20 1:47 PM, Martin Liška wrote:

Hey.


Hello.

I deferred the patch for GCC 12. Since the time, I messed up with options
I feel more familiar with the option handling. So ...



This is a follow-up of the discussion that happened in thread about 
no_stack_protector
attribute: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545916.html

The current optimize attribute works in the following way:
- 1) we take current global_options as base
- 2) maybe_default_options is called for the currently selected optimization 
level, which
  means all rules in default_options_table are executed
- 3) attribute values are applied (via decode_options)

So the step 2) is problematic: in case of -O2 -fno-omit-frame-pointer and 
__attribute__((optimize("-fno-stack-protector")))
ends basically with -O2 -fno-stack-protector because -fno-omit-frame-pointer is 
default:
 /* -O1 and -Og optimizations.  */
 { OPT_LEVELS_1_PLUS, OPT_fomit_frame_pointer, NULL, 1 },

My patch handled and the current optimize attribute really behaves that same as 
appending attribute value
to the command line. So far so good. We should also reflect that in 
documentation entry which is quite
vague right now:


^^^ all these are still valid arguments, plus I'm adding a new test-case that 
tests that.



"""
The optimize attribute is used to specify that a function is to be compiled 
with different optimization options than specified on the command line.
"""


I addressed that with documentation changes, should be more clear to users. 
Moreover, I noticed that we declare 'optimize' attribute
as something not for a production use:

"The optimize attribute should be used for debugging purposes only. It is not 
suitable in production code."

Are we sure about the statement? I know that e.g. glibc uses that.



and we may want to handle -Ox in the attribute in a special way. I guess many 
macro/pragma users expect that

-O2 -ftree-vectorize and __attribute__((optimize(1))) will end with -O1 and not
with -ftree-vectorize -O1 ?


The situation with 'target' attribute is different. When parsing the attribute, 
we intentionally drop all existing target flags:

$ cat -n gcc/config/i386/i386-options.c
...
   1245    if (opt == IX86_FUNCTION_SPECIFIC_ARCH)
   1246  {
   1247    /* If arch= is set,  clear all bits in 
x_ix86_isa_flags,
   1248   except for ISA_64BIT, ABI_64, ABI_X32, and CODE16
   1249   and all bits in x_ix86_isa_flags2.  */
   1250    opts->x_ix86_isa_flags &= (OPTION_MASK_ISA_64BIT
   1251   | OPTION_MASK_ABI_64
   1252   | OPTION_MASK_ABI_X32
   1253   | OPTION_MASK_CODE16);
   1254    opts->x_ix86_isa_flags_explicit &= 
(OPTION_MASK_ISA_64BIT
   1255    | 
OPTION_MASK_ABI_64
   1256    | 
OPTION_MASK_ABI_X32
   1257    | 
OPTION_MASK_CODE16);
   1258    opts->x_ix86_isa_flags2 = 0;
   1259    opts->x_ix86_isa_flags2_explicit = 0;
   1260  }

That seems logical because target attribute is used for e.g. ifunc 
multi-versioning and one needs
to be sure all existing ISA flags are dropped. However, I noticed clang behaves 
differently:

$ cat hreset.c
#pragma GCC target "arch=geode"
#include 
void foo(unsigned int eax)
{
   _hreset (eax);
}

$ clang hreset.c -mhreset  -c -O2 -m32
$ gcc hreset.c -mhreset  -c -O2 -m32
In file included from 
/home/marxin/bin/gcc/lib64/gcc/x86_64-pc-linux-gnu/12.0.0/include/x86gprintrin.h:97,
  from 
/home/marxin/bin/gcc/lib64/gcc/x86_64-pc-linux-gnu/12.0.0/include/immintrin.h:27,
  from hreset.c:2:
hreset.c: In function ‘foo’:
/home/marxin/bin/gcc/lib64/gcc/x86_64-pc-linux-gnu/12.0.0/include/hresetintrin.h:39:1:
 error: inlining failed in call to ‘always_inline’ ‘_hreset’: target specific 
option mismatch
    39 | _hreset (unsigned int __EAX)
   | ^~~
hreset.c:5:3: note: called from here
 5 |   _hreset (eax);
   |   ^

Anyway, I think the current target attribute handling should be preserved.

Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

Ready to be installed?
Thanks,
Martin



I'm also planning to take a look at the target macro/attribute, I expect 
similar problems:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97469

Thoughts?
Thanks,
Martin








[committed 3/6] arm: Add command-line option for enabling CVE-2021-35465 mitigation [PR102035]

2021-08-24 Thread Richard Earnshaw via Gcc-patches
Add a new option, -mfix-cmse-cve-2021-35465 and document it.  Enable it
automatically for cortex-m33, cortex-m35p and cortex-m55.

gcc:
PR target/102035
* config/arm/arm.opt (mfix-cmse-cve-2021-35465): New option.
* doc/invoke.texi (Arm Options): Document it.
* config/arm/arm-cpus.in (quirk_vlldm): New feature bit.
(ALL_QUIRKS): Add quirk_vlldm.
(cortex-m33): Add quirk_vlldm.
(cortex-m35p, cortex-m55): Likewise.
* config/arm/arm.c (arm_option_override): Enable fix_vlldm if
targetting an affected CPU and not explicitly controlled on
the command line.
---
 gcc/config/arm/arm-cpus.in | 9 +++--
 gcc/config/arm/arm.c   | 9 +
 gcc/config/arm/arm.opt | 4 
 gcc/doc/invoke.texi| 9 +
 4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/gcc/config/arm/arm-cpus.in b/gcc/config/arm/arm-cpus.in
index 249995a6bca..bcc9ebe9fe0 100644
--- a/gcc/config/arm/arm-cpus.in
+++ b/gcc/config/arm/arm-cpus.in
@@ -186,6 +186,9 @@ define feature quirk_armv6kz
 # Cortex-M3 LDRD quirk.
 define feature quirk_cm3_ldrd
 
+# v8-m/v8.1-m VLLDM errata.
+define feature quirk_vlldm
+
 # Don't use .cpu assembly directive
 define feature quirk_no_asmcpu
 
@@ -322,7 +325,7 @@ define implied vfp_base MVE MVE_FP ALL_FP
 # architectures.
 # xscale isn't really a 'quirk', but it isn't an architecture either and we
 # need to ignore it for matching purposes.
-define fgroup ALL_QUIRKS   quirk_no_volatile_ce quirk_armv6kz quirk_cm3_ldrd 
xscale quirk_no_asmcpu
+define fgroup ALL_QUIRKS   quirk_no_volatile_ce quirk_armv6kz quirk_cm3_ldrd 
quirk_vlldm xscale quirk_no_asmcpu
 
 define fgroup IGNORE_FOR_MULTILIB cdecp0 cdecp1 cdecp2 cdecp3 cdecp4 cdecp5 
cdecp6 cdecp7
 
@@ -1571,6 +1574,7 @@ begin cpu cortex-m33
  architecture armv8-m.main+dsp+fp
  option nofp remove ALL_FP
  option nodsp remove armv7em
+ isa quirk_vlldm
  costs v7m
 end cpu cortex-m33
 
@@ -1580,6 +1584,7 @@ begin cpu cortex-m35p
  architecture armv8-m.main+dsp+fp
  option nofp remove ALL_FP
  option nodsp remove armv7em
+ isa quirk_vlldm
  costs v7m
 end cpu cortex-m35p
 
@@ -1591,7 +1596,7 @@ begin cpu cortex-m55
  option nomve remove mve mve_float
  option nofp remove ALL_FP mve_float
  option nodsp remove MVE mve_float
- isa quirk_no_asmcpu
+ isa quirk_no_asmcpu quirk_vlldm
  costs v7m
  vendor 41
 end cpu cortex-m55
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 11dafc70067..5c929417f93 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3616,6 +3616,15 @@ arm_option_override (void)
fix_cm3_ldrd = 0;
 }
 
+  /* Enable fix_vlldm by default if required.  */
+  if (fix_vlldm == 2)
+{
+  if (bitmap_bit_p (arm_active_target.isa, isa_bit_quirk_vlldm))
+   fix_vlldm = 1;
+  else
+   fix_vlldm = 0;
+}
+
   /* Hot/Cold partitioning is not currently supported, since we can't
  handle literal pool placement in that case.  */
   if (flag_reorder_blocks_and_partition)
diff --git a/gcc/config/arm/arm.opt b/gcc/config/arm/arm.opt
index 7417b55122a..a7677eeb45c 100644
--- a/gcc/config/arm/arm.opt
+++ b/gcc/config/arm/arm.opt
@@ -268,6 +268,10 @@ Target Var(fix_cm3_ldrd) Init(2)
 Avoid overlapping destination and address registers on LDRD instructions
 that may trigger Cortex-M3 errata.
 
+mfix-cmse-cve-2021-35465
+Target Var(fix_vlldm) Init(2)
+Mitigate issues with VLLDM on some M-profile devices (CVE-2021-35465).
+
 munaligned-access
 Target Var(unaligned_access) Init(2) Save
 Enable unaligned word and halfword accesses to packed data.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index a9d56fecf4e..b8f5d9e1cce 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -808,6 +808,7 @@ Objective-C and Objective-C++ Dialects}.
 -mverbose-cost-dump @gol
 -mpure-code @gol
 -mcmse @gol
+-mfix-cmse-cve-2021-35465 @gol
 -mfdpic}
 
 @emph{AVR Options}
@@ -20743,6 +20744,14 @@ Generate secure code as per the "ARMv8-M Security 
Extensions: Requirements on
 Development Tools Engineering Specification", which can be found on
 @url{https://developer.arm.com/documentation/ecm0359818/latest/}.
 
+@item -mfix-cmse-cve-2021-35465
+@opindex mfix-cmse-cve-2021-35465
+Mitigate against a potential security issue with the @code{VLLDM} instruction
+in some M-profile devices when using CMSE (CVE-2021-365465).  This option is
+enabled by default when the option @option{-mcpu=} is used with
+@code{cortex-m33}, @code{cortex-m35p} or @code{cortex-m55}.  The option
+@option{-mno-fix-cmse-cve-2021-35465} can be used to disable the mitigation.
+
 @item -mfdpic
 @itemx -mno-fdpic
 @opindex mfdpic
-- 
2.25.1



[committed 6/6] arm: Add tests for VLLDM mitigation [PR102035]

2021-08-24 Thread Richard Earnshaw via Gcc-patches
New tests for the erratum mitigation.

gcc/testsuite:
PR target/102035
* gcc.target/arm/cmse/mainline/8_1m/soft/cmse-13a.c: New test.
* gcc.target/arm/cmse/mainline/8_1m/soft/cmse-7a.c: Likewise.
* gcc.target/arm/cmse/mainline/8_1m/soft/cmse-8a.c: Likewise.
* gcc.target/arm/cmse/mainline/8_1m/softfp-sp/cmse-7a.c: Likewise.
* gcc.target/arm/cmse/mainline/8_1m/softfp-sp/cmse-8a.c: Likewise.
* gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-13a.c: Likewise.
* gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-7a.c: Likewise.
* gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-8a.c: Likewise.
---
 .../arm/cmse/mainline/8_1m/soft/cmse-13a.c| 31 +++
 .../arm/cmse/mainline/8_1m/soft/cmse-7a.c | 28 +
 .../arm/cmse/mainline/8_1m/soft/cmse-8a.c | 30 ++
 .../cmse/mainline/8_1m/softfp-sp/cmse-7a.c| 27 
 .../cmse/mainline/8_1m/softfp-sp/cmse-8a.c| 29 +
 .../arm/cmse/mainline/8_1m/softfp/cmse-13a.c  | 30 ++
 .../arm/cmse/mainline/8_1m/softfp/cmse-7a.c   | 27 
 .../arm/cmse/mainline/8_1m/softfp/cmse-8a.c   | 29 +
 8 files changed, 231 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-13a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-7a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-8a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp-sp/cmse-7a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp-sp/cmse-8a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-13a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-7a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-8a.c

diff --git a/gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-13a.c 
b/gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-13a.c
new file mode 100644
index 000..553cc7837e1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-13a.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-mcmse -mfloat-abi=soft -mfix-cmse-cve-2021-35465" }  */
+/* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=*" } { 
"-mfloat-abi=soft" } } */
+
+#include "../../../cmse-13.x"
+
+/* Checks for saving and clearing prior to function call.  */
+/* Shift on the same register as blxns.  */
+/* { dg-final { scan-assembler "lsrs\t(r\[1,4-9\]|r10|fp|ip), \\1, 
#1.*blxns\t\\1" } } */
+/* { dg-final { scan-assembler "lsls\t(r\[1,4-9\]|r10|fp|ip), \\1, 
#1.*blxns\t\\1" } } */
+/* { dg-final { scan-assembler-not "mov\tr0, r4" } } */
+/* { dg-final { scan-assembler-not "mov\tr2, r4" } } */
+/* { dg-final { scan-assembler-not "mov\tr3, r4" } } */
+/* { dg-final { scan-assembler "push\t\{r4, r5, r6, r7, r8, r9, r10, fp\}" } } 
*/
+/* { dg-final { scan-assembler "vlstm\tsp" } } */
+/* Check the right registers are cleared and none appears twice.  */
+/* { dg-final { scan-assembler "clrm\t\{(r1, )?(r4, )?(r5, )?(r6, )?(r7, 
)?(r8, )?(r9, )?(r10, )?(fp, )?(ip, )?APSR\}" } } */
+/* Check that the right number of registers is cleared and thus only one
+   register is missing.  */
+/* { dg-final { scan-assembler "clrm\t\{((r\[1,4-9\]|r10|fp|ip), ){9}APSR\}" } 
} */
+/* Check that no cleared register is used for blxns.  */
+/* { dg-final { scan-assembler-not 
"clrm\t\{\[^\}\]\+(r\[1,4-9\]|r10|fp|ip),\[^\}\]\+\}.*blxns\t\\1" } } */
+/* Check for v8.1-m variant of erratum work-around.  */
+/* { dg-final { scan-assembler "vscclrm\t\{vpr\}" } } */
+/* { dg-final { scan-assembler "vlldm\tsp" } } */
+/* { dg-final { scan-assembler "pop\t\{r4, r5, r6, r7, r8, r9, r10, fp\}" } } 
*/
+/* { dg-final { scan-assembler-not "vmov" } } */
+/* { dg-final { scan-assembler-not "vmsr" } } */
+
+/* Now we check that we use the correct intrinsic to call.  */
+/* { dg-final { scan-assembler "blxns" } } */
diff --git a/gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-7a.c 
b/gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-7a.c
new file mode 100644
index 000..ce02fdea643
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-7a.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-mcmse -mfloat-abi=soft -mfix-cmse-cve-2021-35465" }  */
+/* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=*" } { 
"-mfloat-abi=soft" } } */
+
+#include "../../../cmse-7.x"
+
+/* Checks for saving and clearing prior to function call.  */
+/* Shift on the same register as blxns.  */
+/* { dg-final { scan-assembler "lsrs\t(r\[0-9\]|r10|fp|ip), \\1, 
#1.*blxns\t\\1" } } */
+/* { dg-final { scan-assembler "lsls\t(r\[0-9\]|r10|fp|ip), \\1, 
#1.*blxns\t\\1" } } */
+/* { dg-final { scan-assembler "push\t\{r4, r5, r6, r7, r8, r9, r10, fp\}" } } 
*/

[committed 5/6] arm: fix vlldm erratum for Armv8.1-m [PR102035]

2021-08-24 Thread Richard Earnshaw via Gcc-patches
For Armv8.1-m we generate code that emits VLLDM directly and do not
rely on support code in the library, so emit the mitigation directly
as well, when required.  In this case, we can use the compiler options
to determine when to apply the fix and when it is safe to omit it.

gcc:
PR target/102035
* config/arm/arm.md (attribute arch): Add fix_vlldm.
(arch_enabled): Use it.
* config/arm/vfp.md (lazy_store_multiple_insn): Add alternative to
use when erratum mitigation is needed.
---
 gcc/config/arm/arm.md | 11 +--
 gcc/config/arm/vfp.md | 10 +++---
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 064604808cc..5d3f21b91c4 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -132,9 +132,12 @@ (define_attr "length" ""
 ; TARGET_32BIT, "t1" or "t2" to specify a specific Thumb mode.  "v6"
 ; for ARM or Thumb-2 with arm_arch6, and nov6 for ARM without
 ; arm_arch6.  "v6t2" for Thumb-2 with arm_arch6 and "v8mb" for ARMv8-M
-; Baseline.  This attribute is used to compute attribute "enabled",
+; Baseline.  "fix_vlldm" is for fixing the v8-m/v8.1-m VLLDM erratum.
+; This attribute is used to compute attribute "enabled",
 ; use type "any" to enable an alternative in all cases.
-(define_attr "arch" 
"any,a,t,32,t1,t2,v6,nov6,v6t2,v8mb,iwmmxt,iwmmxt2,armv6_or_vfpv3,neon,mve"
+(define_attr "arch" "any, a, t, 32, t1, t2, v6,nov6, v6t2, \
+v8mb, fix_vlldm, iwmmxt, iwmmxt2, armv6_or_vfpv3, \
+neon, mve"
   (const_string "any"))
 
 (define_attr "arch_enabled" "no,yes"
@@ -177,6 +180,10 @@ (define_attr "arch_enabled" "no,yes"
  (match_test "TARGET_THUMB1 && arm_arch8"))
 (const_string "yes")
 
+(and (eq_attr "arch" "fix_vlldm")
+ (match_test "fix_vlldm"))
+(const_string "yes")
+
 (and (eq_attr "arch" "iwmmxt2")
  (match_test "TARGET_REALLY_IWMMXT2"))
 (const_string "yes")
diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md
index 9961f9389fe..f0030a8c36a 100644
--- a/gcc/config/arm/vfp.md
+++ b/gcc/config/arm/vfp.md
@@ -1720,11 +1720,15 @@ (define_insn "lazy_store_multiple_insn"
 
 (define_insn "lazy_load_multiple_insn"
   [(unspec_volatile
-[(mem:BLK (match_operand:SI 0 "s_register_operand" "rk"))]
+[(mem:BLK (match_operand:SI 0 "s_register_operand" "rk,rk"))]
 VUNSPEC_VLLDM)]
   "use_cmse && reload_completed"
-  "vlldm%?\\t%0"
-  [(set_attr "predicable" "yes")
+  "@
+   vscclrm\\t{vpr}\;vlldm\\t%0
+   vlldm\\t%0"
+  [(set_attr "arch" "fix_vlldm,*")
+   (set_attr "predicable" "no")
+   (set_attr "length" "8,4")
(set_attr "type" "load_4")]
 )
 
-- 
2.25.1



[committed 0/6] arm: mitigation for CVE-2021-35465

2021-08-24 Thread Richard Earnshaw via Gcc-patches
Arm recently disclosed a security-related erratum (CVE-2021-35465) for
some armv8-m and armv8.1-m products relating to use of the VLLDM
instruction during the transition from secure to non-secure state.

This patch implements the recommended software mitigation for this
erratum for use on unfixed silicon products.

The patch series is essentially in two parts.  The first two patches
are really clean-ups that first address a problem with the RTL in the
machine description for VLLDM and VLSTM instructions and then improve
the reliability of testing for the availability of CMSE when running
the test suite.  The remaining patches then implement the mitigation
itself and add some additional tests to the testsuite.

I will also back-port this series to gcc-10 and gcc-11.

R.

Richard Earnshaw (6):
  arm: Fix general issues with patterns for VLLDM and VLSTM
  arm: testsuite: improve detection of CMSE hardware.
  arm: Add command-line option for enabling CVE-2021-35465 mitigation
[PR102035]
  arm: add erratum mitigation to __gnu_cmse_nonsecure_call [PR102035]
  arm: fix vlldm erratum for Armv8.1-m [PR102035]
  arm: Add tests for VLLDM mitigation [PR102035]

 gcc/config/arm/arm-cpus.in|  9 --
 gcc/config/arm/arm.c  |  9 ++
 gcc/config/arm/arm.md | 11 +--
 gcc/config/arm/arm.opt|  4 +++
 gcc/config/arm/vfp.md | 29 ++---
 gcc/doc/invoke.texi   |  9 ++
 .../arm/cmse/mainline/8_1m/soft/cmse-13a.c| 31 +++
 .../arm/cmse/mainline/8_1m/soft/cmse-7a.c | 28 +
 .../arm/cmse/mainline/8_1m/soft/cmse-8a.c | 30 ++
 .../cmse/mainline/8_1m/softfp-sp/cmse-7a.c| 27 
 .../cmse/mainline/8_1m/softfp-sp/cmse-8a.c| 29 +
 .../arm/cmse/mainline/8_1m/softfp/cmse-13a.c  | 30 ++
 .../arm/cmse/mainline/8_1m/softfp/cmse-7a.c   | 27 
 .../arm/cmse/mainline/8_1m/softfp/cmse-8a.c   | 29 +
 gcc/testsuite/lib/target-supports.exp | 15 -
 libgcc/config/arm/cmse_nonsecure_call.S   |  5 +++
 16 files changed, 299 insertions(+), 23 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-13a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-7a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/soft/cmse-8a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp-sp/cmse-7a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp-sp/cmse-8a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-13a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-7a.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/cmse/mainline/8_1m/softfp/cmse-8a.c

-- 
2.25.1



[committed 1/6] arm: Fix general issues with patterns for VLLDM and VLSTM

2021-08-24 Thread Richard Earnshaw via Gcc-patches
Both lazy_store_multiple_insn and lazy_load_multiple_insn contain
invalid RTL (eg they contain a post_inc statement outside of a mem).
What's more, the instructions concerned do not modify their input
address register.  We probably got away with this because they are
generated so late in the compilation that no subsequent pass needed to
understand them.  Nevertheless, this could cause problems someday, so
fixed to use a simple legal unspec.

gcc:
* config/arm/vfp.md (lazy_store_multiple_insn): Rewrite as valid RTL.
(lazy_load_multiple_insn): Likewise.
---
 gcc/config/arm/vfp.md | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md
index 93e963696da..9961f9389fe 100644
--- a/gcc/config/arm/vfp.md
+++ b/gcc/config/arm/vfp.md
@@ -1703,12 +1703,15 @@ (define_insn "*clear_vfp_multiple"
(set_attr "type" "mov_reg")]
 )
 
+;; Both this and the next instruction are treated by GCC in the same
+;; way as a blockage pattern.  That's perhaps stronger than it needs
+;; to be, but we do not want accesses to the VFP register bank to be
+;; moved across either instruction.
+
 (define_insn "lazy_store_multiple_insn"
-  [(set (match_operand:SI 0 "s_register_operand" "+")
-   (post_dec:SI (match_dup 0)))
-   (unspec_volatile [(const_int 0)
-(mem:SI (post_dec:SI (match_dup 0)))]
-   VUNSPEC_VLSTM)]
+  [(unspec_volatile
+[(mem:BLK (match_operand:SI 0 "s_register_operand" "rk"))]
+VUNSPEC_VLSTM)]
   "use_cmse && reload_completed"
   "vlstm%?\\t%0"
   [(set_attr "predicable" "yes")
@@ -1716,11 +1719,9 @@ (define_insn "lazy_store_multiple_insn"
 )
 
 (define_insn "lazy_load_multiple_insn"
-  [(set (match_operand:SI 0 "s_register_operand" "+")
-   (post_inc:SI (match_dup 0)))
-   (unspec_volatile:SI [(const_int 0)
-   (mem:SI (match_dup 0))]
-  VUNSPEC_VLLDM)]
+  [(unspec_volatile
+[(mem:BLK (match_operand:SI 0 "s_register_operand" "rk"))]
+VUNSPEC_VLLDM)]
   "use_cmse && reload_completed"
   "vlldm%?\\t%0"
   [(set_attr "predicable" "yes")
-- 
2.25.1



[committed 4/6] arm: add erratum mitigation to __gnu_cmse_nonsecure_call [PR102035]

2021-08-24 Thread Richard Earnshaw via Gcc-patches
Add the recommended erratum mitigation sequence to
__gnu_cmse_nonsecure_call for use on Armv8-m.main devices. Since this
is in the library code we cannot know in advance whether the core we
are running on will be affected by this, so always enable it.

libgcc:
PR target/102035
* config/arm/cmse_nonsecure_call.S (__gnu_cmse_nonsecure_call):
Add vlldm erratum work-around.
---
 libgcc/config/arm/cmse_nonsecure_call.S | 5 +
 1 file changed, 5 insertions(+)

diff --git a/libgcc/config/arm/cmse_nonsecure_call.S 
b/libgcc/config/arm/cmse_nonsecure_call.S
index 00830ade98e..c8e0fbbe665 100644
--- a/libgcc/config/arm/cmse_nonsecure_call.S
+++ b/libgcc/config/arm/cmse_nonsecure_call.S
@@ -102,6 +102,11 @@ blxns  r4
 #ifdef __ARM_PCS_VFP
 vpop.f64{d8-d15}
 #else
+/* VLLDM erratum mitigation sequence. */
+mrsr5, control
+tstr5, #8/* CONTROL_S.SFPA */
+it ne
+.inst.w0xeeb00a40/* vmovne s0, s0 */
 vlldm  sp/* Lazy restore of d0-d16 and FPSCR.  */
 addsp, sp, #0x88 /* Free space used to save floating point registers.  
*/
 #endif /* __ARM_PCS_VFP */
-- 
2.25.1



[committed 2/6] arm: testsuite: improve detection of CMSE hardware.

2021-08-24 Thread Richard Earnshaw via Gcc-patches
The test for CMSE support being available in hardware currently
relies on the compiler not optimizing away a secure gateway operation.
But even that is suspect, because the SG instruction is just a NOP
on armv8-m implementations that do not support the security extension.

Replace the existing test with a new one that reads and checks
the appropriate hardware feature register (memory mapped).  This has
to be run from secure mode, but that shouldn't matter, because if we
can't do that we can't really test the CMSE extensions anyway.  We
retain the SG instruction to ensure the test can't pass accidentally
if run on pre-armv8-m devices.

gcc/testsuite:
* lib/target-supports.exp (check_effective_target_arm_cmse_hw):
Check the CMSE feature register, rather than relying on the
SG operation causing an execution fault.
---
 gcc/testsuite/lib/target-supports.exp | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 66ce48d7dfd..06f5b1eb54d 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4878,15 +4878,16 @@ proc check_effective_target_arm_cmse_ok {} {
 
 proc check_effective_target_arm_cmse_hw { } {
 return [check_runtime arm_cmse_hw_available {
-   int __attribute__ ((cmse_nonsecure_entry)) ns_func(void)
-   {
-   return 0;
-   }
int main (void)
{
-   return ns_func();
-   }
-} "-mcmse -Wl,--section-start,.gnu.sgstubs=0x0040"]
+   unsigned id_pfr1;
+   asm ("ldr\t%0, =0xe000ed44\n" \
+"ldr\t%0, [%0]\n" \
+"sg" : "=l" (id_pfr1));
+   /* Exit with code 0 iff security extension is available.  */
+   return !(id_pfr1 & 0xf0);
+   }
+} "-mcmse"]
 }
 # Return 1 if the target supports executing MVE instructions, 0
 # otherwise.
-- 
2.25.1



Re: [PATCH] [i386] Enable avx512 embedde broadcast for vpternlog.

2021-08-24 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 24, 2021 at 6:25 PM liuhongt  wrote:
>
> gcc/ChangeLog:
>
> PR target/101989
> * config/i386/sse.md (_vternlog):
> Enable avx512 embedded broadcast.
> (*_vternlog_all): Ditto.
> (_vternlog_mask): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> PR target/101989
> * gcc.target/i386/pr101989-broadcast-1.c: New test.
Pushed to trunk.
> ---
>  gcc/config/i386/sse.md|  6 ++--
>  .../gcc.target/i386/pr101989-broadcast-1.c| 31 +++
>  2 files changed, 34 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 40ba4bfab46..3d24ad48cdf 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -10034,7 +10034,7 @@ (define_insn "_vternlog"
> (unspec:VI48_AVX512VL
>   [(match_operand:VI48_AVX512VL 1 "register_operand" "0")
>(match_operand:VI48_AVX512VL 2 "register_operand" "v")
> -  (match_operand:VI48_AVX512VL 3 "nonimmediate_operand" "vm")
> +  (match_operand:VI48_AVX512VL 3 "bcst_vector_operand" "vmBr")
>(match_operand:SI 4 "const_0_to_255_operand")]
>   UNSPEC_VTERNLOG))]
>"TARGET_AVX512F"
> @@ -10048,7 +10048,7 @@ (define_insn "*_vternlog_all"
> (unspec:V
>   [(match_operand:V 1 "register_operand" "0")
>(match_operand:V 2 "register_operand" "v")
> -  (match_operand:V 3 "nonimmediate_operand" "vm")
> +  (match_operand:V 3 "bcst_vector_operand" "vmBr")
>(match_operand:SI 4 "const_0_to_255_operand")]
>   UNSPEC_VTERNLOG))]
>"TARGET_AVX512F"
> @@ -10281,7 +10281,7 @@ (define_insn "_vternlog_mask"
>   (unspec:VI48_AVX512VL
> [(match_operand:VI48_AVX512VL 1 "register_operand" "0")
>  (match_operand:VI48_AVX512VL 2 "register_operand" "v")
> -(match_operand:VI48_AVX512VL 3 "nonimmediate_operand" "vm")
> +(match_operand:VI48_AVX512VL 3 "bcst_vector_operand" "vmBr")
>  (match_operand:SI 4 "const_0_to_255_operand")]
> UNSPEC_VTERNLOG)
>   (match_dup 1)
> diff --git a/gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c 
> b/gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c
> new file mode 100644
> index 000..d03d192915f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512vl" } */
> +/* { dg-final { scan-assembler-times "vpternlog" 4 } } */
> +/* { dg-final { scan-assembler-times "\\\{1to4\\\}" 4 } } */
> +#include
> +extern long long C;
> +__m256d
> +copysign2_pd(__m256d from, __m256d to) {
> +  __m256i a = _mm256_castpd_si256(from);
> +  __m256d avx_signbit = 
> _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
> +  /* (avx_signbit & from) | (~avx_signbit & to)  */
> +  return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), 
> _mm256_andnot_pd(avx_signbit, to));
> +}
> +
> +__m256i
> +mask_pternlog (__m256i A, __m256i B, __mmask8 U)
> +{
> +  return _mm256_mask_ternarylogic_epi64 (A, U, B, _mm256_set1_epi64x (C) 
> ,202);
> +}
> +
> +__m256i
> +maskz_pternlog (__m256i A, __m256i B, __mmask8 U)
> +{
> +  return _mm256_maskz_ternarylogic_epi64 (U, A, B, _mm256_set1_epi64x (C) 
> ,202);
> +}
> +
> +__m256i
> +none_pternlog (__m256i A, __m256i B)
> +{
> +  return _mm256_ternarylogic_epi64 (A, B, _mm256_set1_epi64x (C) ,202);
> +}
> --
> 2.27.0
>


-- 
BR,
Hongtao


[PATCH] [i386] Enable avx512 embedde broadcast for vpternlog.

2021-08-24 Thread liuhongt via Gcc-patches
gcc/ChangeLog:

PR target/101989
* config/i386/sse.md (_vternlog):
Enable avx512 embedded broadcast.
(*_vternlog_all): Ditto.
(_vternlog_mask): Ditto.

gcc/testsuite/ChangeLog:

PR target/101989
* gcc.target/i386/pr101989-broadcast-1.c: New test.
---
 gcc/config/i386/sse.md|  6 ++--
 .../gcc.target/i386/pr101989-broadcast-1.c| 31 +++
 2 files changed, 34 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 40ba4bfab46..3d24ad48cdf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -10034,7 +10034,7 @@ (define_insn "_vternlog"
(unspec:VI48_AVX512VL
  [(match_operand:VI48_AVX512VL 1 "register_operand" "0")
   (match_operand:VI48_AVX512VL 2 "register_operand" "v")
-  (match_operand:VI48_AVX512VL 3 "nonimmediate_operand" "vm")
+  (match_operand:VI48_AVX512VL 3 "bcst_vector_operand" "vmBr")
   (match_operand:SI 4 "const_0_to_255_operand")]
  UNSPEC_VTERNLOG))]
   "TARGET_AVX512F"
@@ -10048,7 +10048,7 @@ (define_insn "*_vternlog_all"
(unspec:V
  [(match_operand:V 1 "register_operand" "0")
   (match_operand:V 2 "register_operand" "v")
-  (match_operand:V 3 "nonimmediate_operand" "vm")
+  (match_operand:V 3 "bcst_vector_operand" "vmBr")
   (match_operand:SI 4 "const_0_to_255_operand")]
  UNSPEC_VTERNLOG))]
   "TARGET_AVX512F"
@@ -10281,7 +10281,7 @@ (define_insn "_vternlog_mask"
  (unspec:VI48_AVX512VL
[(match_operand:VI48_AVX512VL 1 "register_operand" "0")
 (match_operand:VI48_AVX512VL 2 "register_operand" "v")
-(match_operand:VI48_AVX512VL 3 "nonimmediate_operand" "vm")
+(match_operand:VI48_AVX512VL 3 "bcst_vector_operand" "vmBr")
 (match_operand:SI 4 "const_0_to_255_operand")]
UNSPEC_VTERNLOG)
  (match_dup 1)
diff --git a/gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c 
b/gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c
new file mode 100644
index 000..d03d192915f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101989-broadcast-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512vl" } */
+/* { dg-final { scan-assembler-times "vpternlog" 4 } } */
+/* { dg-final { scan-assembler-times "\\\{1to4\\\}" 4 } } */
+#include
+extern long long C;
+__m256d
+copysign2_pd(__m256d from, __m256d to) {
+  __m256i a = _mm256_castpd_si256(from);
+  __m256d avx_signbit = 
_mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
+  /* (avx_signbit & from) | (~avx_signbit & to)  */
+  return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), 
_mm256_andnot_pd(avx_signbit, to));
+}
+
+__m256i
+mask_pternlog (__m256i A, __m256i B, __mmask8 U)
+{
+  return _mm256_mask_ternarylogic_epi64 (A, U, B, _mm256_set1_epi64x (C) ,202);
+}
+
+__m256i
+maskz_pternlog (__m256i A, __m256i B, __mmask8 U)
+{
+  return _mm256_maskz_ternarylogic_epi64 (U, A, B, _mm256_set1_epi64x (C) 
,202);
+}
+
+__m256i
+none_pternlog (__m256i A, __m256i B)
+{
+  return _mm256_ternarylogic_epi64 (A, B, _mm256_set1_epi64x (C) ,202);
+}
-- 
2.27.0



Host and offload targets have no common meaning of address spaces (was: [ping] Re-unify 'omp_build_component_ref' and 'oacc_build_component_ref')

2021-08-24 Thread Thomas Schwinge
Hi!

On 2021-08-19T22:13:56+0200, I wrote:
> On 2021-08-16T10:21:04+0200, Jakub Jelinek  wrote:
>> On Mon, Aug 16, 2021 at 10:08:42AM +0200, Thomas Schwinge wrote:
> |> Concerning the current 'gcc/omp-low.c:omp_build_component_ref', for the
> |> current set of offloading testcases, we never see a
> |> '!ADDR_SPACE_GENERIC_P' there, so the address space handling doesn't seem
> |> to be necessary there (but also won't do any harm: no-op).
>>
>> Are you sure this can't trigger?
>> Say
>> extern int __seg_fs a;
>>
>> void
>> foo (void)
>> {
>>   #pragma omp parallel private (a)
>>   a = 2;
>> }
>
> That test case doesn't run into 'omp_build_component_ref' at all,
> but [I've pushed an altered and extended variant that does],
> "Add 'libgomp.c/address-space-1.c'".
>
> In this case, 'omp_build_component_ref' called via host compilation
> 'pass_lower_omp', it's the 'field_type' that has 'address-space-1', not
> 'obj_type', so indeed Kwok's new code is a no-op:
>
> (gdb) call debug_tree(field_type)
>   type > I think keeping the qual addr space here is the wrong thing to do,
>> it should keep the other quals and clear the address space instead,
>> the whole struct is going to be in generic addres space, isn't it?
>
> Correct for 'omp_build_component_ref' called via host compilation
> 'pass_lower_omp'

> However, regarding the former comment -- shouldn't we force generic
> address space for all 'tree' types read in via LTO streaming for
> offloading compilation?  I assume that (in the general case) address
> spaces are never compatible between host and offloading compilation?
> For [...] "Add 'libgomp.c/address-space-1.c'", propagating the
> '__seg_fs' address space across the offloading boundary (assuming I did
> interpret the dumps correctly) doesn't seem to cause any problems

As I found later, actually the 'address-space-1' per host '__seg_fs' does
cause the "Intel MIC (emulated) offloading execution failure"
mentioned/XFAILed for 'libgomp.c/address-space-1.c': SIGSEGV, like
(expected) for host execution.  For GCN offloading target, it maps to
GCN 'ADDR_SPACE_FLAT' which apparently doesn't cause any ill effects (for
that simple test case).  The nvptx offloading target doesn't consider
address spaces at all.

Is the attached "Host and offload targets have no common meaning of
address spaces" OK to push?


Then, is that the way to do this, or should we add in
'gcc/tree-streamer-out.c:pack_ts_base_value_fields':

if (lto_stream_offload_p)
  gcc_assert (ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (expr)));

..., and elsewhere sanitize this for offloading compilation?  Jakub's
suggestion above, regarding 'gcc/omp-low.c:omp_build_component_ref':

| I think keeping the qual addr space here is the wrong thing to do,
| it should keep the other quals and clear the address space instead

But it's not obvious to me that indeed this is the one place where this
would need to be done?  (It ought to work for
'libgomp.c/address-space-1.c', and any other occurrences would run into
the 'assert', so that ought to be "fine", though?)


And, should we have a new hook
'void targetm.addr_space.validate (addr_space_t as)' (better name?),
called via 'gcc/emit-rtl.c:set_mem_attrs' (only? -- assuming this is the
appropriate canonic function where address space use is observed?), to
make sure that the requested 'as' is valid for the target?
'default_addr_space_validate' would refuse everything but
'ADDR_SPACE_GENERIC_P (as)'; this hook would need implementing for all
handful of targets making use of address spaces (supposedly matching the
logic how they call 'c_register_addr_space'?).  (The closest existing
hook seems to be 'targetm.addr_space.diagnose_usage', only defined for
AVR, and called from "the front ends" (C only).)


Grüße
 Thomas


-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From e01e06bd17bf2c7cb182d30bed02babc5edfa183 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Tue, 24 Aug 2021 11:14:10 +0200
Subject: [PATCH] Host and offload targets have no common meaning of address
 spaces

	gcc/
	* tree-streamer-out.c (pack_ts_base_value_fields): Don't pack
	'TYPE_ADDR_SPACE' for offloading.
	* tree-streamer-in.c (unpack_ts_base_value_fields): Don't unpack
	'TYPE_ADDR_SPACE' for offloading.
	libgomp/
	* testsuite/libgomp.c/address-space-1.c: Remove 'dg-xfail-run-if'
	for 'offload_device_intel_mic'.
---
 gcc/tree-streamer-in.c| 2 ++
 gcc/tree-streamer-out.c   | 4 +++-
 libgomp/testsuite/libgomp.c/address-space-1.c | 4 
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-streamer-in.c b/gcc/tree-streamer-in.c
index e0522bf2ac1..acdc48ef09f 100644
--- a/gcc/tree-streamer-in.c
+++ b/gcc/tree-streamer-in.c
@@ -146,7 +146,9 @@ 

Re: [PATCH][v2] Remove --param vect-inner-loop-cost-factor

2021-08-24 Thread Jan Hubicka
> > 
> > I noticed loop-doloop.c use _int version and likely_max, maybe you want 
> > that here?
> >  
> >   est_niter = get_estimated_loop_iterations_int (loop);
> >   if (est_niter == -1)
> > est_niter = get_likely_max_loop_iterations_int (loop)
> 
> I think that are two different things - get_estimated_loop_iterations_int
> are the average number of iterations while 
> get_likely_max_loop_iterations_int is an upper bound.  I'm not sure we
> want to use an upper bound for costing.
> 
> Based on feedback from Honza I'm currently testing the variant below
> which keeps the --param and uses it to cap the estimated number of
> iterations.  That makes the scaling more precise for inner loops that
> don't iterate much but keeps the --param to avoid overflow and to
> keep the present behavior when there's no reliable profile info
> available.

indeed, get_likely_max_loop_iterations_int may be very large.  In some
cases it however will give useful value - for example when loop travels
small array.

So what one can use it for is to cap the --param value.
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index c521b43a47c..cbdd5b407da 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -1519,6 +1519,13 @@ vect_analyze_loop_form (class loop *loop, 
> vec_info_shared *shared)
>stmt_vec_info inner_loop_cond_info
>   = loop_vinfo->lookup_stmt (inner_loop_cond);
>STMT_VINFO_TYPE (inner_loop_cond_info) = loop_exit_ctrl_vec_info_type;
> +  /* If we have an estimate on the number of iterations of the inner
> +  loop use that to limit the scale for costing, otherwise use
> +  --param vect-inner-loop-cost-factor literally.  */
> +  widest_int nit;
> +  if (get_estimated_loop_iterations (loop->inner, ))
> + LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo)
> +   = wi::smin (nit, param_vect_inner_loop_cost_factor).to_uhwi ();

  if (get_estimated_loop_iterations (loop->inner, ))
LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo)
  = wi::smin (nit, REG_BR_PROB_BASE /*or other random big cap  
*/).to_uhwi ();
  else if (get_likely_max_loop_iterations (loop->inner, ))
LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo)
  = wi::smin (nit, param_vect_inner_loop_cost_factor).to_uhwi ();
  else
LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo)
  = param_vect_inner_loop_cost_factor;

I.e. if we really know the number of iterations, we probably want to
weight by it but we want to cap to avoid overflows.  I assume if we kno
that tripcount is 1 or more we basically do not care about damage
done to outer loop as long as iner loop improves?

If we know max number of iterations and it is smaller then the param,
we want to use it as cap.

The situation where get_estimated_loop_iteraitons returns wrong value
should be rare - basically when the loop was duplicated by inliner
(or other transform) and it behaves a lot differently then the average
execution of the loop in the train run.  In this case we could also
argue that the loop is not statistically important :)

Honza
>  }
>  
>gcc_assert (!loop->aux);
> -- 
> 2.31.1
> 


Re: [PATCH] [i386] Optimize (a & b) | (c & ~b) to vpternlog instruction.

2021-08-24 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 24, 2021 at 9:36 AM liuhongt  wrote:
>
> Also optimize below 3 forms to vpternlog, op1, op2, op3 are
> register_operand or unary_p as (not reg)
>
> A: (any_logic (any_logic op1 op2) op3)
> B: (any_logic (any_logic op1 op2) (any_logic op3 op4)) op3/op4 should
> be equal to op1/op2
> C: (any_logic (any_logic (any_logic:op1 op2) op3) op4) op3/op4 should
> be equal to op1/op2
>
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>
> gcc/ChangeLog:
>
> PR target/101989
> * config/i386/i386-protos.h
> (ix86_strip_reg_or_notreg_operand): New declare.
> * config/i386/i386.c (ix86_rtx_costs): Define cost for
> UNSPEC_VTERNLOG.
> (ix86_strip_reg_or_notreg_operand): New function.
Push to trunk by changing ix86_strip_reg_or_notreg_operand to macro,
function call seems too inefficient for the simple strip unary.
> * config/i386/predicates.md (reg_or_notreg_operand): New
> predicate.
> * config/i386/sse.md (*_vternlog_all): New define_insn.
> (*_vternlog_1): New pre_reload
> define_insn_and_split.
> (*_vternlog_2): Ditto.
> (*_vternlog_3): Ditto.
> (any_logic1,any_logic2): New code iterator.
> (logic_op): New code attribute.
> (ternlogsuffix): Extend to VNxDF and VNxSF.
>
> gcc/testsuite/ChangeLog:
>
> PR target/101989
> * gcc.target/i386/pr101989-1.c: New test.
> * gcc.target/i386/pr101989-2.c: New test.
> * gcc.target/i386/avx512bw-shiftqihi-constant-1.c: Adjust testcase.
> ---
>  gcc/config/i386/i386-protos.h |   1 +
>  gcc/config/i386/i386.c|  13 +
>  gcc/config/i386/predicates.md |   7 +
>  gcc/config/i386/sse.md| 234 ++
>  .../i386/avx512bw-shiftqihi-constant-1.c  |   4 +-
>  gcc/testsuite/gcc.target/i386/pr101989-1.c|  51 
>  gcc/testsuite/gcc.target/i386/pr101989-2.c| 102 
>  7 files changed, 410 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101989-2.c
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 2fd13074c81..2bdaadcf4f3 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -60,6 +60,7 @@ extern rtx standard_80387_constant_rtx (int);
>  extern int standard_sse_constant_p (rtx, machine_mode);
>  extern const char *standard_sse_constant_opcode (rtx_insn *, rtx *);
>  extern bool ix86_standard_x87sse_constant_load_p (const rtx_insn *, rtx);
> +extern rtx ix86_strip_reg_or_notreg_operand (rtx);
>  extern bool ix86_pre_reload_split (void);
>  extern bool symbolic_reference_mentioned_p (rtx);
>  extern bool extended_reg_mentioned_p (rtx);
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 46844fab08f..a69225ccc81 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -5236,6 +5236,14 @@ ix86_standard_x87sse_constant_load_p (const rtx_insn 
> *insn, rtx dst)
>return true;
>  }
>
> +/* Returns true if INSN can be transformed from a memory load
> +   to a supported FP constant load.  */
> +rtx
> +ix86_strip_reg_or_notreg_operand (rtx op)
> +{
> +  return UNARY_P (op) ? XEXP (op, 0) : op;
> +}
> +
>  /* Predicate for pre-reload splitters with associated instructions,
> which can match any time before the split1 pass (usually combine),
> then are unconditionally split in that pass and should not be
> @@ -20544,6 +20552,11 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> outer_code_i, int opno,
>  case UNSPEC:
>if (XINT (x, 1) == UNSPEC_TP)
> *total = 0;
> +  else if (XINT(x, 1) == UNSPEC_VTERNLOG)
> +   {
> + *total = cost->sse_op;
> + return true;
> +   }
>return false;
>
>  case VEC_SELECT:
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 9321f332ef9..df5acb425d4 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -1044,6 +1044,13 @@ (define_predicate "reg_or_pm1_operand"
> (ior (match_test "op == const1_rtx")
>  (match_test "op == constm1_rtx")
>
> +;; True for registers, or (not: registers).  Used to optimize 3-operand
> +;; bitwise operation.
> +(define_predicate "reg_or_notreg_operand"
> +  (ior (match_operand 0 "register_operand")
> +   (and (match_code "not")
> +   (match_test "register_operand (XEXP (op, 0), mode)"
> +
>  ;; True if OP is acceptable as operand of DImode shift expander.
>  (define_predicate "shiftdi_operand"
>(if_then_else (match_test "TARGET_64BIT")
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 13889687793..0acd749d21c 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -933,7 +933,9 @@ (define_mode_attr iptr
>  ;; Mapping of vector modes to 

Re: [PATCH] Make sure we're playing with integral modes before call extract_integral_bit_field.

2021-08-24 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 24, 2021 at 5:40 PM Hongtao Liu  wrote:
>
> On Tue, Aug 17, 2021 at 9:52 AM Hongtao Liu  wrote:
> >
> > On Mon, Aug 9, 2021 at 4:34 PM Hongtao Liu  wrote:
> > >
> > > On Fri, Aug 6, 2021 at 7:27 PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > On Fri, Aug 6, 2021 at 11:05 AM Richard Sandiford
> > > >  wrote:
> > > > >
> > > > > Richard Biener via Gcc-patches  writes:
> > > > > > On Fri, Aug 6, 2021 at 5:32 AM liuhongt  
> > > > > > wrote:
> > > > > >>
> > > > > >> Hi:
> > > > > >> ---
> > > > > >> OK, I think sth is amiss here upthread.  insv/extv do look like 
> > > > > >> they
> > > > > >> are designed
> > > > > >> to work on integer modes (but docs do not say anything about this 
> > > > > >> here).
> > > > > >> In fact the caller of extract_bit_field_using_extv is named
> > > > > >> extract_integral_bit_field.  Of course nothing seems to check what 
> > > > > >> kind of
> > > > > >> modes we're dealing with, but we're for example happily doing
> > > > > >> expand_shift in 'mode'.  In the extract_integral_bit_field call 
> > > > > >> 'mode' is
> > > > > >> some integer mode and op0 is HFmode?  From the above I get it's
> > > > > >> the other way around?  In that case we should wrap the
> > > > > >> call to extract_integral_bit_field, extracting in an integer mode 
> > > > > >> with the
> > > > > >> same size as 'mode' and then converting the result as (subreg:HF 
> > > > > >> (reg:HI ...)).
> > > > > >> ---
> > > > > >>   This is a separate patch as a follow up of upper comments.
> > > > > >>
> > > > > >> gcc/ChangeLog:
> > > > > >>
> > > > > >> * expmed.c (extract_bit_field_1): Wrap the call to
> > > > > >> extract_integral_bit_field, extracting in an integer mode 
> > > > > >> with
> > > > > >> the same size as 'tmode' and then converting the result
> > > > > >> as (subreg:tmode (reg:imode)).
> > > > > >>
> > > > > >> gcc/testsuite/ChangeLog:
> > > > > >> * gcc.target/i386/float16-5.c: New test.
> > > > > >> ---
> > > > > >>  gcc/expmed.c  | 19 +++
> > > > > >>  gcc/testsuite/gcc.target/i386/float16-5.c | 12 
> > > > > >>  2 files changed, 31 insertions(+)
> > > > > >>  create mode 100644 gcc/testsuite/gcc.target/i386/float16-5.c
> > > > > >>
> > > > > >> diff --git a/gcc/expmed.c b/gcc/expmed.c
> > > > > >> index 3143f38e057..72790693ef0 100644
> > > > > >> --- a/gcc/expmed.c
> > > > > >> +++ b/gcc/expmed.c
> > > > > >> @@ -1850,6 +1850,25 @@ extract_bit_field_1 (rtx str_rtx, 
> > > > > >> poly_uint64 bitsize, poly_uint64 bitnum,
> > > > > >>op0_mode = opt_scalar_int_mode ();
> > > > > >>  }
> > > > > >>
> > > > > >> +  /* Make sure we are playing with integral modes.  Pun with 
> > > > > >> subregs
> > > > > >> + if we aren't. When tmode is HFmode, op0 is SImode, there 
> > > > > >> will be ICE
> > > > > >> + in extract_integral_bit_field.  */
> > > > > >> +  if (int_mode_for_mode (tmode).exists ()
> > > > > >
> > > > > > check !INTEGRAL_MODE_P (tmode) before, that should be slightly
> > > > > > cheaper.  Then imode should always be != tmode.  Maybe
> > > > > > even GET_MDOE_CLASS (tmode) != MODE_INT since I'm not sure
> > > > > > how it behaves for composite modes.
> > > > > >
> > > > > > Of course the least surprises would happen when we restrict this
> > > > > > to FLOAT_MODE_P (tmode).
> > > > > >
> > > > > > Richard - any preferences?
> > > > >
> > > > > If the bug is that extract_integral_bit_field is being called with
> > > > > a non-integral mode parameter, then it looks odd that we can still
> > > > > fall through to it without an integral mode (when exists is false).
> > > > >
> > > > > If calling extract_integral_bit_field without an integral mode is
> > > > > a bug then I think we should have:
> > > > >
> > > > >   int_mode_for_mode (mode).require ()
> > > > >
> > > > > whenever mode is not already SCALAR_INT_MODE_P/is_a.
> > > > > Ideally we'd make the mode parameter scalar_int_mode too.
> > > > >
> > > > > extract_integral_bit_field currently has:
> > > > >
> > > > >   /* Find a correspondingly-sized integer field, so we can apply
> > > > >  shifts and masks to it.  */
> > > > >   scalar_int_mode int_mode;
> > > > >   if (!int_mode_for_mode (tmode).exists (_mode))
> > > > > /* If this fails, we should probably push op0 out to memory and 
> > > > > then
> > > > >do a load.  */
> > > > > int_mode = int_mode_for_mode (mode).require ();
> > > > >
> > > > > which would seem to be redundant after this change.
> > > >
> > > > I'm not sure what exactly the bug is, but extract_integral_bit_field 
> > > > ends
> > > > up creating a lowpart subreg that's not allowed and that ICEs (and I
> > > > can't see a way to check beforehand).  So it seems to me at least
> > > > part of that function doesn't expect non-integral extraction modes.
> > > >
> > > > But who knows - the code is older than I am (OK, not, but older than
> > > > my 

Re: [PATCH] Make sure we're playing with integral modes before call extract_integral_bit_field.

2021-08-24 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 17, 2021 at 9:52 AM Hongtao Liu  wrote:
>
> On Mon, Aug 9, 2021 at 4:34 PM Hongtao Liu  wrote:
> >
> > On Fri, Aug 6, 2021 at 7:27 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Fri, Aug 6, 2021 at 11:05 AM Richard Sandiford
> > >  wrote:
> > > >
> > > > Richard Biener via Gcc-patches  writes:
> > > > > On Fri, Aug 6, 2021 at 5:32 AM liuhongt  wrote:
> > > > >>
> > > > >> Hi:
> > > > >> ---
> > > > >> OK, I think sth is amiss here upthread.  insv/extv do look like they
> > > > >> are designed
> > > > >> to work on integer modes (but docs do not say anything about this 
> > > > >> here).
> > > > >> In fact the caller of extract_bit_field_using_extv is named
> > > > >> extract_integral_bit_field.  Of course nothing seems to check what 
> > > > >> kind of
> > > > >> modes we're dealing with, but we're for example happily doing
> > > > >> expand_shift in 'mode'.  In the extract_integral_bit_field call 
> > > > >> 'mode' is
> > > > >> some integer mode and op0 is HFmode?  From the above I get it's
> > > > >> the other way around?  In that case we should wrap the
> > > > >> call to extract_integral_bit_field, extracting in an integer mode 
> > > > >> with the
> > > > >> same size as 'mode' and then converting the result as (subreg:HF 
> > > > >> (reg:HI ...)).
> > > > >> ---
> > > > >>   This is a separate patch as a follow up of upper comments.
> > > > >>
> > > > >> gcc/ChangeLog:
> > > > >>
> > > > >> * expmed.c (extract_bit_field_1): Wrap the call to
> > > > >> extract_integral_bit_field, extracting in an integer mode 
> > > > >> with
> > > > >> the same size as 'tmode' and then converting the result
> > > > >> as (subreg:tmode (reg:imode)).
> > > > >>
> > > > >> gcc/testsuite/ChangeLog:
> > > > >> * gcc.target/i386/float16-5.c: New test.
> > > > >> ---
> > > > >>  gcc/expmed.c  | 19 +++
> > > > >>  gcc/testsuite/gcc.target/i386/float16-5.c | 12 
> > > > >>  2 files changed, 31 insertions(+)
> > > > >>  create mode 100644 gcc/testsuite/gcc.target/i386/float16-5.c
> > > > >>
> > > > >> diff --git a/gcc/expmed.c b/gcc/expmed.c
> > > > >> index 3143f38e057..72790693ef0 100644
> > > > >> --- a/gcc/expmed.c
> > > > >> +++ b/gcc/expmed.c
> > > > >> @@ -1850,6 +1850,25 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 
> > > > >> bitsize, poly_uint64 bitnum,
> > > > >>op0_mode = opt_scalar_int_mode ();
> > > > >>  }
> > > > >>
> > > > >> +  /* Make sure we are playing with integral modes.  Pun with subregs
> > > > >> + if we aren't. When tmode is HFmode, op0 is SImode, there will 
> > > > >> be ICE
> > > > >> + in extract_integral_bit_field.  */
> > > > >> +  if (int_mode_for_mode (tmode).exists ()
> > > > >
> > > > > check !INTEGRAL_MODE_P (tmode) before, that should be slightly
> > > > > cheaper.  Then imode should always be != tmode.  Maybe
> > > > > even GET_MDOE_CLASS (tmode) != MODE_INT since I'm not sure
> > > > > how it behaves for composite modes.
> > > > >
> > > > > Of course the least surprises would happen when we restrict this
> > > > > to FLOAT_MODE_P (tmode).
> > > > >
> > > > > Richard - any preferences?
> > > >
> > > > If the bug is that extract_integral_bit_field is being called with
> > > > a non-integral mode parameter, then it looks odd that we can still
> > > > fall through to it without an integral mode (when exists is false).
> > > >
> > > > If calling extract_integral_bit_field without an integral mode is
> > > > a bug then I think we should have:
> > > >
> > > >   int_mode_for_mode (mode).require ()
> > > >
> > > > whenever mode is not already SCALAR_INT_MODE_P/is_a.
> > > > Ideally we'd make the mode parameter scalar_int_mode too.
> > > >
> > > > extract_integral_bit_field currently has:
> > > >
> > > >   /* Find a correspondingly-sized integer field, so we can apply
> > > >  shifts and masks to it.  */
> > > >   scalar_int_mode int_mode;
> > > >   if (!int_mode_for_mode (tmode).exists (_mode))
> > > > /* If this fails, we should probably push op0 out to memory and then
> > > >do a load.  */
> > > > int_mode = int_mode_for_mode (mode).require ();
> > > >
> > > > which would seem to be redundant after this change.
> > >
> > > I'm not sure what exactly the bug is, but extract_integral_bit_field ends
> > > up creating a lowpart subreg that's not allowed and that ICEs (and I
> > > can't see a way to check beforehand).  So it seems to me at least
> > > part of that function doesn't expect non-integral extraction modes.
> > >
> > > But who knows - the code is older than I am (OK, not, but older than
> > > my involvment in GCC ;))
> > >
> > How about attached patch w/ below changelog
> >
> > gcc/ChangeLog:
> >
> > * expmed.c (extract_bit_field_1): Make sure we're playing with
> > integral modes before call extract_integral_bit_field.
> > (extract_integral_bit_field): Add a parameter of type
> > 

Re: [PATCH 4/6] Support -fexcess-precision=16 which will enable FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16.

2021-08-24 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 17, 2021 at 9:53 AM Hongtao Liu  wrote:
>
> On Fri, Aug 6, 2021 at 2:06 PM Hongtao Liu  wrote:
> >
> > On Tue, Aug 3, 2021 at 10:44 AM Hongtao Liu  wrote:
> > >
> > > On Tue, Aug 3, 2021 at 3:34 AM Joseph Myers  
> > > wrote:
> > > >
> > > > On Mon, 2 Aug 2021, liuhongt via Gcc-patches wrote:
> > > >
> > > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > > > index 7979e240426..dc673c89bc8 100644
> > > > > --- a/gcc/config/i386/i386.c
> > > > > +++ b/gcc/config/i386/i386.c
> > > > > @@ -23352,6 +23352,8 @@ ix86_get_excess_precision (enum 
> > > > > excess_precision_type type)
> > > > >   return (type == EXCESS_PRECISION_TYPE_STANDARD
> > > > >   ? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT
> > > > >   : FLT_EVAL_METHOD_UNPREDICTABLE);
> > > > > +  case EXCESS_PRECISION_TYPE_FLOAT16:
> > > > > + return FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16;
> > > > >default:
> > > > >   gcc_unreachable ();
> > > > >  }
> > > >
> > > > I'd expect an error for -fexcess-precision=16 with -mfpmath=387 (since 
> > > > x87
> > > > doesn't do float or double arithmetic, but -fexcess-precision=16 implies
> > > > that all of _Float16, float and double are represented to the range and
> > > > precision of their type withou any excess precision).
> > > >
> > > Yes, additional changes like this.
> > >
> > > modified   gcc/config/i386/i386.c
> > > @@ -23443,6 +23443,9 @@ ix86_get_excess_precision (enum
> > > excess_precision_type type)
> > >   ? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT
> > >   : FLT_EVAL_METHOD_UNPREDICTABLE);
> > >case EXCESS_PRECISION_TYPE_FLOAT16:
> > > + if (TARGET_80387
> > > + && !(TARGET_SSE_MATH && TARGET_SSE))
> > > +   error ("%<-fexcess-precision=16%> is not compatible with 
> > > %<-mfpmath=387%>");
> > >   return FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16;
> > >default:
> > >   gcc_unreachable ();
> > > new file   gcc/testsuite/gcc.target/i386/float16-7.c
> > > @@ -0,0 +1,9 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mfpmath=387 -fexcess-precision=16" } */
> > > +/* { dg-excess-errors "'-fexcess-precision=16' is not compatible with
> > > '-mfpmath=387'" } */
> > > +_Float16
> > > +foo (_Float16 a, _Float16 b)
> > > +{
> > > +  return a + b;/* { dg-error "'-fexcess-precision=16' is not
> > > compatible with '-mfpmath=387'" } */
> > > +}
> > > +
> > >
> > > > --
> > > > Joseph S. Myers
> > > > jos...@codesourcery.com
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> > Updated patch and ping for it.
> >
> > Also for backend changes.
> > 1. For backend m68k/s390 which totally don't support _Float16, backend
> > will issue an error for -fexcess-precision=16, I think it should be
> > fine.
> > 2. For backend like arm/aarch64 which supports _Float16 , backend will
> > set FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 for -fexcess-precision=16 even
> > hardware instruction for fp16 is not supported. Would that be ok for
> > arm?
>
> Ping for this patch.
>
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao

Rebased and ping^3, there are plenty of avx512fp16 patches blocked by
this patch, i'd like someone to help review this patch.
-- 
BR,
Hongtao
From 5deedc50dde5846dff4d0bf0719a7a5facc3723e Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Mon, 2 Aug 2021 10:56:45 +0800
Subject: [PATCH] Support -fexcess-precision=16 which will enable
 FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16.

gcc/ada/ChangeLog:

	* gcc-interface/misc.c (gnat_post_options): Issue an error for
	-fexcess-precision=16.

gcc/c-family/ChangeLog:

	* c-common.c (excess_precision_mode_join): Update below comments.
	(c_ts18661_flt_eval_method): Set excess_precision_type to
	EXCESS_PRECISION_TYPE_FLOAT16 when -fexcess-precision=16.
	* c-cppbuiltin.c (cpp_atomic_builtins): Update below comments.
	(c_cpp_flt_eval_method_iec_559): Set excess_precision_type to
	EXCESS_PRECISION_TYPE_FLOAT16 when -fexcess-precision=16.

gcc/ChangeLog:

	* common.opt: Support -fexcess-precision=16.
	* config/aarch64/aarch64.c (aarch64_excess_precision): Return
	FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when
	EXCESS_PRECISION_TYPE_FLOAT16.
	* config/arm/arm.c (arm_excess_precision): Ditto.
	* config/i386/i386.c (ix86_get_excess_precision): Ditto.
	* config/m68k/m68k.c (m68k_excess_precision): Issue an error
	when EXCESS_PRECISION_TYPE_FLOAT16.
	* config/s390/s390.c (s390_excess_precision): Ditto.
	* coretypes.h (enum excess_precision_type): Add
	EXCESS_PRECISION_TYPE_FLOAT16.
	* doc/tm.texi (TARGET_C_EXCESS_PRECISION): Update documents.
	* doc/tm.texi.in (TARGET_C_EXCESS_PRECISION): Ditto.
	* doc/extend.texi (Half-Precision): Document
	-fexcess-precision=16.
	* flag-types.h (enum excess_precision): Add
	EXCESS_PRECISION_FLOAT16.
	* target.def (excess_precision): Update document.
	* tree.c (excess_precision_type): Set excess_precision_type to
	EXCESS_PRECISION_FLOAT16 when -fexcess-precision=16.

gcc/fortran/ChangeLog:

	* options.c (gfc_post_options): Issue an error 

[PATCH] Change illegitimate constant into memref of constant pool in change_zero_ext.

2021-08-24 Thread liuhongt via Gcc-patches
Hi:
  This patch extend change_zero_ext to change illegitimate constant
into constant pool, this will enable simplification of below:

Trying 5 -> 7:
5: r85:V4SF=[`*.LC0']
  REG_EQUAL const_vector
7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
  REG_DEAD r85:V4SF
  REG_EQUAL const_vector
Failed to match this instruction:
(set (reg:V4SF 84)
(const_vector:V4SF [
(const_double:SF 3.0e+0 [0x0.cp+2])
(const_double:SF 2.0e+0 [0x0.8p+2])
(const_double:SF 4.0e+0 [0x0.8p+3])
(const_double:SF 1.0e+0 [0x0.8p+1])
]))

(insn 5 2 7 2 (set (reg:V4SF 85)
(mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16
A128])) 
1600 {movv4sf_internal}
 (expr_list:REG_EQUAL (const_vector:V4SF [
(const_double:SF 4.0e+0 [0x0.8p+3])
(const_double:SF 3.0e+0 [0x0.cp+2])
(const_double:SF 2.0e+0 [0x0.8p+2])
(const_double:SF 1.0e+0 [0x0.8p+1])
])
(nil)))
(insn 7 5 11 2 (set (reg:V4SF 84)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 85)
(reg:V4SF 85))
(parallel [
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 4 [0x4])
(const_int 7 [0x7])
])))
3015 {sse_shufps_v4sf}
 (expr_list:REG_DEAD (reg:V4SF 85)
(expr_list:REG_EQUAL (const_vector:V4SF [
(const_double:SF 3.0e+0 [0x0.cp+2])
(const_double:SF 2.0e+0 [0x0.8p+2])
(const_double:SF 4.0e+0 [0x0.8p+3])
(const_double:SF 1.0e+0 [0x0.8p+1])
])
(nil

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  Ok for trunk?

gcc/ChangeLog:

PR rtl-optimization/43147
* combine.c (recog_for_combine_1): Adjust comments of ..
(change_zero_ext):.. this, and extend to change illegitimate
constant into constant pool.

gcc/testsuite/ChangeLog:

PR rtl-optimization/43147
* gcc.target/i386/pr43147.c: New test.
* gcc.target/i386/pr22076.c: Adjust testcase.
---
 gcc/combine.c   | 20 +++-
 gcc/testsuite/gcc.target/i386/pr22076.c |  4 ++--
 gcc/testsuite/gcc.target/i386/pr43147.c | 15 +++
 3 files changed, 36 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr43147.c

diff --git a/gcc/combine.c b/gcc/combine.c
index cb5fa401fcb..0b2afdf45af 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -11404,7 +11404,8 @@ recog_for_combine_1 (rtx *pnewpat, rtx_insn *insn, rtx 
*pnotes)
 
 /* Change every ZERO_EXTRACT and ZERO_EXTEND of a SUBREG that can be
expressed as an AND and maybe an LSHIFTRT, to that formulation.
-   Return whether anything was so changed.  */
+   Return whether anything was so changed.
+   Also change illegitimate constant into memref of constant pool.  */
 
 static bool
 change_zero_ext (rtx pat)
@@ -11417,6 +11418,23 @@ change_zero_ext (rtx pat)
 {
   rtx x = **iter;
   scalar_int_mode mode, inner_mode;
+  machine_mode const_mode = GET_MODE (x);
+
+  /* Change illegitimate constant into memref of constant pool.  */
+  if (CONSTANT_P (x)
+ && !const_vec_duplicate_p (x)
+ && const_mode != BLKmode
+ && GET_CODE (x) != HIGH
+ && GET_MODE_SIZE (const_mode).is_constant ()
+ && !targetm.legitimate_constant_p (const_mode, x)
+ && !targetm.cannot_force_const_mem (const_mode, x))
+   {
+ x = force_const_mem (GET_MODE (x), x);
+ SUBST (**iter, x);
+ changed = true;
+ continue;
+   }
+
   if (!is_a  (GET_MODE (x), ))
continue;
   int size;
diff --git a/gcc/testsuite/gcc.target/i386/pr22076.c 
b/gcc/testsuite/gcc.target/i386/pr22076.c
index 427ffcd4920..866c387280f 100644
--- a/gcc/testsuite/gcc.target/i386/pr22076.c
+++ b/gcc/testsuite/gcc.target/i386/pr22076.c
@@ -15,5 +15,5 @@ void test ()
   x = _mm_add_pi8 (mm0, mm1);
 }
 
-/* { dg-final { scan-assembler-times "movq" 2 } } */
-/* { dg-final { scan-assembler-not "movl" { target nonpic } } } */
+/* { dg-final { scan-assembler-times "movq" 2 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl" 4  { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr43147.c 
b/gcc/testsuite/gcc.target/i386/pr43147.c
new file mode 100644
index 000..3c30f917c06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr43147.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2" } */
+/* { dg-final { scan-assembler "movaps" } } */
+/* { dg-final { scan-assembler-not "shufps" } } */
+
+#include 
+
+__m128
+foo (void)
+{
+  __m128 m = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
+  m = _mm_shuffle_ps(m, m, 0xC9);
+  m = _mm_shuffle_ps(m, m, 0x2D);
+  return m;
+}
-- 
2.27.0



[PATCH, rs6000] Disable gimple fold for float or double vec_minmax when fast-math is not set

2021-08-24 Thread HAO CHEN GUI via Gcc-patches

Hi

   The patch disables gimple fold for float or double vec_min/max 
builtin when fast-math is not set. Two test cases are added to verify 
the patch.


   The attachments are the patch diff and change log file.

   Bootstrapped and tested on powerpc64le-linux with no regressions. Is 
this okay for trunk? Any recommendations? Thanks a lot.


* config/rs6000/rs6000-call.c (rs6000_gimple_fold_builtin)
: Modify the expansions.
* gcc.target/powerpc/vec-minmax-1.c: New test.
* gcc.target/powerpc/vec-minmax-2.c: Likewise.
diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
index b4e13af4dc6..90527734ceb 100644
--- a/gcc/config/rs6000/rs6000-call.c
+++ b/gcc/config/rs6000/rs6000-call.c
@@ -12159,6 +12159,11 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
   return true;
 /* flavors of vec_min.  */
 case VSX_BUILTIN_XVMINDP:
+case ALTIVEC_BUILTIN_VMINFP:
+  if (!flag_finite_math_only || flag_signed_zeros)
+   return false;
+  /* Fall through to MIN_EXPR.  */
+  gcc_fallthrough ();
 case P8V_BUILTIN_VMINSD:
 case P8V_BUILTIN_VMINUD:
 case ALTIVEC_BUILTIN_VMINSB:
@@ -12167,7 +12172,6 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 case ALTIVEC_BUILTIN_VMINUB:
 case ALTIVEC_BUILTIN_VMINUH:
 case ALTIVEC_BUILTIN_VMINUW:
-case ALTIVEC_BUILTIN_VMINFP:
   arg0 = gimple_call_arg (stmt, 0);
   arg1 = gimple_call_arg (stmt, 1);
   lhs = gimple_call_lhs (stmt);
@@ -12177,6 +12181,11 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
   return true;
 /* flavors of vec_max.  */
 case VSX_BUILTIN_XVMAXDP:
+case ALTIVEC_BUILTIN_VMAXFP:
+  if (!flag_finite_math_only || flag_signed_zeros)
+   return false;
+  /* Fall through to MAX_EXPR.  */
+  gcc_fallthrough ();
 case P8V_BUILTIN_VMAXSD:
 case P8V_BUILTIN_VMAXUD:
 case ALTIVEC_BUILTIN_VMAXSB:
@@ -12185,7 +12194,6 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 case ALTIVEC_BUILTIN_VMAXUB:
 case ALTIVEC_BUILTIN_VMAXUH:
 case ALTIVEC_BUILTIN_VMAXUW:
-case ALTIVEC_BUILTIN_VMAXFP:
   arg0 = gimple_call_arg (stmt, 0);
   arg1 = gimple_call_arg (stmt, 1);
   lhs = gimple_call_lhs (stmt);
diff --git a/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c 
b/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c
new file mode 100644
index 000..9782d1b9308
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c
@@ -0,0 +1,51 @@
+/* { dg-do compile { target { powerpc64le-*-* } } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
+/* { dg-final { scan-assembler-times {\mxvmax[ds]p\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mxvmin[ds]p\M} 2 } } */
+
+/* This test verifies that float or double vec_min/max are bound to
+   xv[min|max][d|s]p instructions when fast-math is not set.  */
+
+
+#include 
+
+#ifdef _BIG_ENDIAN
+   const int PREF_D = 0;
+#else
+   const int PREF_D = 1;
+#endif
+
+double vmaxd (double a, double b)
+{
+  vector double va = vec_promote (a, PREF_D);
+  vector double vb = vec_promote (b, PREF_D);
+  return vec_extract (vec_max (va, vb), PREF_D);
+}
+
+double vmind (double a, double b)
+{
+  vector double va = vec_promote (a, PREF_D);
+  vector double vb = vec_promote (b, PREF_D);
+  return vec_extract (vec_min (va, vb), PREF_D);
+}
+
+#ifdef _BIG_ENDIAN
+   const int PREF_F = 0;
+#else
+   const int PREF_F = 3;
+#endif
+
+float vmaxf (float a, float b)
+{
+  vector float va = vec_promote (a, PREF_F);
+  vector float vb = vec_promote (b, PREF_F);
+  return vec_extract (vec_max (va, vb), PREF_F);
+}
+
+float vminf (float a, float b)
+{
+  vector float va = vec_promote (a, PREF_F);
+  vector float vb = vec_promote (b, PREF_F);
+  return vec_extract (vec_min (va, vb), PREF_F);
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/vec-minmax-2.c 
b/gcc/testsuite/gcc.target/powerpc/vec-minmax-2.c
new file mode 100644
index 000..d318b933181
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/vec-minmax-2.c
@@ -0,0 +1,51 @@
+/* { dg-do compile { target { powerpc64le-*-* } } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -ffast-math" } */
+/* { dg-final { scan-assembler-times {\mxsmaxcdp\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mxsmincdp\M} 2 } } */
+
+/* This test verifies that float or double vec_min/max can be converted
+   to scalar comparison when fast-math is set.  */
+
+
+#include 
+
+#ifdef _BIG_ENDIAN
+   const int PREF_D = 0;
+#else
+   const int PREF_D = 1;
+#endif
+
+double vmaxd (double a, double b)
+{
+  vector double va = vec_promote (a, PREF_D);
+  vector double vb = vec_promote (b, PREF_D);
+  return vec_extract (vec_max (va, vb), PREF_D);
+}
+
+double vmind (double a, double b)
+{
+  vector double va = vec_promote (a, PREF_D);
+  vector double vb = vec_promote (b, 

Re: [PATCH][v2] Remove --param vect-inner-loop-cost-factor

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, 24 Aug 2021, Richard Biener wrote:

> On Tue, 24 Aug 2021, Kewen.Lin wrote:
> 
> > Hi Richi,
> > 
> > on 2021/8/23 ??10:33, Richard Biener via Gcc-patches wrote:
> > > This removes --param vect-inner-loop-cost-factor in favor of looking
> > > at the estimated number of iterations of the inner loop
> > > when available and otherwise just assumes a single inner
> > > iteration which is conservative on the side of not vectorizing.
> > > 
> > 
> > I may miss something, the factor seems to be an amplifier, a single
> > inner iteration on the side of not vectorizing only relies on that
> > vector_cost < scalar_cost, if scalar_cost < vector_cost, the direction
> > will be flipped? ({vector,scalar}_cost is only for inner loop part).
> > 
> > Since we don't calculate/compare costing for inner loop independently
> > and early return if scalar_cost < vector_cost for inner loop, I guess
> > it's possible to have "scalar_cost < vector_cost" case theoretically,
> > especially when targets can cost something more on vector side.
> 
> True.
> 
> > > The alternative is to retain the --param for exactly that case,
> > > not sure if the result is better or not.  The --param is new on
> > > head, it was static '50' before.
> > > 
> > 
> > I think the intention of --param is to offer ports a way to tweak
> > it (no ports do it for now though :)).  Not sure how target costing
> > is sensitive to this factor, but I also prefer to make its default
> > value as 50 as Honza suggested to avoid more possible tweakings.
> > 
> > If targets want more, maybe we can extend it to:
> > 
> > default_hook:
> >   return estimated or likely_max if either is valid;
> >   return default value;
> >   
> > target hook:
> >   val = default_hook; // or from scratch
> >   tweak the val as it wishes;  
> > 
> > I guess there is no this need for now.
> >
> > > Any strong opinions?
> > > 
> > > Richard.
> > > 
> > > 2021-08-23  Richard Biener  
> > > 
> > >   * doc/invoke.texi (vect-inner-loop-cost-factor): Remove
> > >   documentation.
> > >   * params.opt (--param vect-inner-loop-cost-factor): Remove.
> > >   * tree-vect-loop.c (_loop_vec_info::_loop_vec_info):
> > >   Initialize inner_loop_cost_factor to 1.
> > >   (vect_analyze_loop_form): Initialize inner_loop_cost_factor
> > >   from the estimated number of iterations of the inner loop.
> > > ---
> > >  gcc/doc/invoke.texi  |  5 -
> > >  gcc/params.opt   |  4 
> > >  gcc/tree-vect-loop.c | 12 +++-
> > >  3 files changed, 11 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index c057cc1e4ae..054950132f6 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -14385,11 +14385,6 @@ code to iterate.  2 allows partial vector loads 
> > > and stores in all loops.
> > >  The parameter only has an effect on targets that support partial
> > >  vector loads and stores.
> > >  
> > > -@item vect-inner-loop-cost-factor
> > > -The factor which the loop vectorizer applies to the cost of statements
> > > -in an inner loop relative to the loop being vectorized.  The default
> > > -value is 50.
> > > -
> > >  @item avoid-fma-max-bits
> > >  Maximum number of bits for which we avoid creating FMAs.
> > >  
> > > diff --git a/gcc/params.opt b/gcc/params.opt
> > > index f9264887b40..f7b19fa430d 100644
> > > --- a/gcc/params.opt
> > > +++ b/gcc/params.opt
> > > @@ -1113,8 +1113,4 @@ Bound on number of runtime checks inserted by the 
> > > vectorizer's loop versioning f
> > >  Common Joined UInteger Var(param_vect_partial_vector_usage) Init(2) 
> > > IntegerRange(0, 2) Param Optimization
> > >  Controls how loop vectorizer uses partial vectors.  0 means never, 1 
> > > means only for loops whose need to iterate can be removed, 2 means for 
> > > all loops.  The default value is 2.
> > >  
> > > --param=vect-inner-loop-cost-factor=
> > > -Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) 
> > > IntegerRange(1, 99) Param Optimization
> > > -The factor which the loop vectorizer applies to the cost of statements 
> > > in an inner loop relative to the loop being vectorized.
> > > -
> > >  ; This comment is to ensure we retain the blank line above.
> > > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> > > index c521b43a47c..cb48717f20e 100644
> > > --- a/gcc/tree-vect-loop.c
> > > +++ b/gcc/tree-vect-loop.c
> > > @@ -841,7 +841,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> > > vec_info_shared *shared)
> > >  single_scalar_iteration_cost (0),
> > >  vec_outside_cost (0),
> > >  vec_inside_cost (0),
> > > -inner_loop_cost_factor (param_vect_inner_loop_cost_factor),
> > > +inner_loop_cost_factor (1),
> > >  vectorizable (false),
> > >  can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
> > >  using_partial_vectors_p (false),
> > > @@ -1519,6 +1519,16 @@ vect_analyze_loop_form (class loop *loop, 
> > > vec_info_shared 

Re: [PATCH] Improved handling of shifts/rotates in bit CCP.

2021-08-24 Thread Richard Biener via Gcc-patches
On Sun, Aug 22, 2021 at 4:03 PM Roger Sayle  wrote:
>
>
>
> This patch is the next in the series to improve bit bounds in tree-ssa's
>
> bit CCP pass, this time: bounds for shifts and rotates by unknown amounts.
>
> This allows us to optimize expressions such as ((x&15)<<(y&24))&64.
>
> In this case, the expression (y&24) contains only two unknown bits,
>
> and can therefore have only four possible values: 0, 8, 16 and 24.
>
> From this (x&15)<<(y&24) has the nonzero bits 0x0f0f0f0f, and from
>
> that ((x&15)<<(y&24))&64 must always be zero.
>
>
>
> One clever use of computer science in this patch is the use of XOR
>
> to efficiently enumerate bit patterns in Gray code order.  As the
>
> order in which we generate values is not significant, it's faster
>
> and more convenient to enumerate values by flipping one bit at a
>
> time, rather than in numerical order [which would require carry
>
> bits and additional logic].
>
>
>
> There's a pre-existing ??? comment in tree-ssa-ccp.c that we should
>
> eventually be able to optimize (x<<(y|8))&255, but this patch takes the
>
> conservatively paranoid approach of only optimizing cases where the
>
> shift/rotate is guaranteed to be less than the target precision, and
>
> therefore avoids changing any cases that potentially might invoke
>
> undefined behavior.  This patch does optimize (x<<((y&31)|8))&255.
>
>
>
> This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
>
> and "make -k check" with no new failures.  OK for mainline?

OK.

Thanks,
Richard.

>
>
>
>
> 2021-08-22  Roger Sayle  
>
>
>
> gcc/ChangeLog
>
> * tree-ssa-ccp.c (get_individual_bits): Helper function to
>
> extract the individual bits from a widest_int constant (mask).
>
> (gray_code_bit_flips): New read-only table for effiently
>
> enumerating permutations/combinations of bits.
>
> (bit_value_binop) [LROTATE_EXPR, RROTATE_EXPR]: Handle rotates
>
> by unknown counts that are guaranteed less than the target
>
> precision and four or fewer unknown bits by enumeration.
>
> [LSHIFT_EXPR, RSHIFT_EXPR]: Likewise, also handle shifts by
>
> enumeration under the same conditions.  Handle remaining
>
> shifts as a mask based upon the minimum possible shift value.
>
>
>
> gcc/testsuite/ChangeLog
>
> * gcc.dg/tree-ssa/ssa-ccp-41.c: New test case.
>
>
>
>
>
> Roger
>
> --
>
>
>


Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, 24 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/19 20:11, Richard Biener wrote:
> >> -  class loop *inn_loop = loop;
> >>   
> >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>   {
> >> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, 
> >> sbitmap contains_call)
> >> to disprove this if possible).  */
> >>  if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>break;
> >> -
> >> -if (!flow_bb_inside_loop_p (inn_loop, bb))
> >> -  break;
> >> -
> >> -if (bb->loop_father->header == bb)
> >> -  {
> >> -if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> -  break;
> >> -
> >> -/* In a loop that is always entered we may proceed anyway.
> >> -   But record that we entered it and stop once we leave it.  */
> >> -inn_loop = bb->loop_father;
> >> -  }
> >>}
> >>   
> >> while (1)
> > I'm not sure this will work correct (I'm not sure how the existing
> > code makes it so either...).  That said, I can't poke any hole
> > into the change.  What I see is that definitely
> > 
> >if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >  last = bb;
> > 
> >if (bitmap_bit_p (contains_call, bb->index))
> >  break;
> > 
> > doesn't work reliably since the DOM ordering will process blocks
> > A B and C in random order for
> > 
> >for (;;)
> > {
> >if (cond)
> >  {
> >A: foo ();
> >  }
> >else B:;
> >C:;
> > }
> > 
> > and thus we can end up setting 'last' to C_before_  processing
> > 'A' and thus arriving at the call foo () ...
> > 
> > get_loop_body_in_dom_order does some "special sauce" but not
> > to address the above problem - but it might be that a subtle
> > issue like the above is the reason for the inner loop handling.
> > The inner loop block order does_not_  adhere to this "special sauce",
> > that is - the "Additionally, if a basic block s dominates
> > the latch, then only blocks dominated by s are be after it."
> > guarantee holds for the outer loop latch, not for the inner.
> > 
> > Digging into the history of fill_always_executed_in_1 doesn't
> > reveal anything - the inner loop handling has been present
> > since introduction by Zdenek - but usually Zdenek has a reason
> > for doing things as he does;)
> 
> Yes, this is really complicated usage, thanks for point it out. :)
> I constructed two cases to verify this with inner loop includes "If A; else 
> B; C". 
> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B 
> contains call;

But it depends on the order of visiting ABC and that's hard to put into
a testcase since it depends on the order of edges and the processing
of the dominance computation.  ABC are simply unordered with respect
to a dominator walk.

> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
> the access order is CAB, but 'last' also won’t be updated to C in 
> fill_always_executed_in_1
> since there is also dominate check, then if A or B contains call, it could 
> break
> successfully. 
> 
> C won't be set to ALWAYS EXECUTED for both circumstance.
> 
> > 
> > Note it might be simply a measure against quadratic complexity,
> > esp. since with your patch we also dive into not always executed
> > subloops as you remove the
> > 
> >if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >  break;
> > 
> > check.  I suggest to evaluate behavior of the patch on a testcase
> > like
> > 
> > void foo (int n, int **k)
> > {
> >for (int i = 0; i < n; ++i)
> >  if (k[0][i])
> >for (int j = 0; j < n; ++j)
> >  if (k[1][j])
> >for (int l = 0; l < n; ++l)
> >  if (k[2][l])
> >...
> > }
> 
> Theoretically the complexity is changing from L1(bbs) to 
> L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
> so fill_always_executed_in_1's execution time is supposed to be increase from
> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also 
> drafted a
> test case has 73-depth loop function with 25 no-ipa function copies each 
> compiled
> in lim2 and lim4 dependently.  Total execution time of 
> fill_always_executed_in_1 is
> increased from 32ms to 58ms, almost doubled but not quadratic?

It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
O(n^2).

> It seems reasonable to see compiling time getting longer since most bbs are 
> checked
> more but a MUST to ensure early break correctly in every loop level... 
> Though loop nodes could be huge, loop depth will never be so large in actual 
> code?

The "in practice" argument is almost always defeated by automatic
program generators ;)

> >  
> > I suspect you'll see quadratic behavior with your 

RE: [ARM] PR66791: Replace builtins for signed vmul_n intrinsics

2021-08-24 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Prathamesh Kulkarni 
> Sent: 24 August 2021 09:02
> To: gcc Patches ; Kyrylo Tkachov
> 
> Subject: Re: [ARM] PR66791: Replace builtins for signed vmul_n intrinsics
> 
> On Fri, 13 Aug 2021 at 16:40, Prathamesh Kulkarni
>  wrote:
> >
> > On Thu, 5 Aug 2021 at 15:44, Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Mon, 12 Jul 2021 at 15:24, Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Mon, 12 Jul 2021 at 15:23, Prathamesh Kulkarni
> > > >  wrote:
> > > > >
> > > > > On Mon, 5 Jul 2021 at 14:47, Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > > This patch replaces builtins with __a * __b for signed variants of
> > > > > > vmul_n intrinsics.
> > > > > > As discussed earlier, the patch has issue if __a * __b overflows, 
> > > > > > and
> > > > > > whether we wish to leave
> > > > > > that as UB.
> > > > > ping
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=6785eb595981abd93ad85ed
> cfdf1d2e43c0841f5
> > > > Oops sorry, I meant this link:
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574428.html
> > > ping * 2 https://gcc.gnu.org/pipermail/gcc-patches/2021-
> July/574428.html
> > ping * 3 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574428.html
> ping * 4 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html

I'm not very comfortable with this change. We'd be introducing direct signed 
multiplications that are undefined on overflow in C, but the vmul instructions 
in Neon have well-defined overflow semantics.
So they wouldn't be exactly equivalent.

Thanks,
Kyrill

> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh


RE: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-24 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Prathamesh Kulkarni 
> Sent: 24 August 2021 09:01
> To: Christophe Lyon 
> Cc: Kyrylo Tkachov ; gcc Patches  patc...@gcc.gnu.org>
> Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> intrinsics
> 
> On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
>  wrote:
> >
> > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
> >  wrote:
> > >
> > >
> > >
> > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni
>  wrote:
> > >>
> > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
> > >>  wrote:
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches  patc...@gcc.gnu.org> wrote:
> > >> >>
> > >> >>
> > >> >>
> > >> >> > -Original Message-
> > >> >> > From: Prathamesh Kulkarni 
> > >> >> > Sent: 24 June 2021 12:11
> > >> >> > To: gcc Patches ; Kyrylo Tkachov
> > >> >> > 
> > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> intrinsics
> > >> >> >
> > >> >> > Hi,
> > >> >> > This patch replaces builtins for vdup_n and vmov_n.
> > >> >> > The patch results in regression for pr51534.c.
> > >> >> > Consider following function:
> > >> >> >
> > >> >> > uint8x8_t f1 (uint8x8_t a) {
> > >> >> >   return vcgt_u8(a, vdup_n_u8(0));
> > >> >> > }
> > >> >> >
> > >> >> > code-gen before patch:
> > >> >> > f1:
> > >> >> > vmov.i32  d16, #0  @ v8qi
> > >> >> > vcgt.u8 d0, d0, d16
> > >> >> > bx lr
> > >> >> >
> > >> >> > code-gen after patch:
> > >> >> > f1:
> > >> >> > vceq.i8 d0, d0, #0
> > >> >> > vmvnd0, d0
> > >> >> > bx lr
> > >> >> >
> > >> >> > I am not sure which one is better tho ?
> > >> >>
> > >> >
> > >> > Hi Prathamesh,
> > >> >
> > >> > This patch introduces a regression on non-hardfp configs (eg arm-
> linux-gnueabi or arm-eabi):
> > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
> assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
> assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
> > >> >
> > >> > Can you fix this?
> > >> The issue is, for following test:
> > >>
> > >> #include 
> > >>
> > >> uint8x8_t f1 (uint8x8_t a) {
> > >>   return vcge_u8(a, vdup_n_u8(0));
> > >> }
> > >>
> > >> armhf code-gen:
> > >> f1:
> > >> vmov.i32  d0, #0x  @ v8qi
> > >> bxlr
> > >>
> > >> arm softfp code-gen:
> > >> f1:
> > >> mov r0, #-1
> > >> mov r1, #-1
> > >> bx  lr
> > >>
> > >> The code-gen for both is same upto split2 pass:
> > >>
> > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
> > >> (const_vector:V8QI [
> > >> (const_int -1 [0x]) repeated x8
> > >> ])) "foo.c":5:1 1052 {*neon_movv8qi}
> > >>  (expr_list:REG_EQUAL (const_vector:V8QI [
> > >> (const_int -1 [0x]) repeated x8
> > >> ])
> > >> (nil)))
> > >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
> > >>  (nil))
> > >>
> > >> and for softfp target, split2 pass splits the assignment to r0 and r1:
> > >>
> > >> (insn 15 6 16 2 (set (reg:SI 0 r0)
> > >> (const_int -1 [0x])) "foo.c":5:1 740
> {*thumb2_movsi_vfp}
> > >>  (nil))
> > >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
> > >> (const_int -1 [0x])) "foo.c":5:1 740
> {*thumb2_movsi_vfp}
> > >>  (nil))
> > >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
> > >>  (nil))
> > >>
> > >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
> > >>
> > > Yes, probably, or try with check-function-bodies.
> > Hi,
> > Sorry for the late response. Does the attached patch look OK ?
> ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html

Ok.
Thanks,
Kyrill

> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > >  Christophe
> > >
> > >> Thanks,
> > >> Prathamesh
> > >> >
> > >> > Thanks
> > >> >
> > >> > Christophe
> > >> >
> > >> >
> > >> >>
> > >> >> I think they're equivalent in practice, in any case the patch itself 
> > >> >> is
> good (move away from RTL builtins).
> > >> >> Ok.
> > >> >> Thanks,
> > >> >> Kyrill
> > >> >>
> > >> >> >
> > >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> > >> >> > which is due to a missed opt in lowering. I had filed it as
> > >> >> > PR98435, and posted a fix for it here:
> > >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-
> June/572648.html
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Prathamesh


RE: [ARM] PR66791: Replace builtin in vld1_dup intrinsics

2021-08-24 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Prathamesh Kulkarni 
> Sent: 24 August 2021 09:01
> To: gcc Patches ; Kyrylo Tkachov
> ; Richard Earnshaw
> 
> Subject: Re: [ARM] PR66791: Replace builtin in vld1_dup intrinsics
> 
> On Fri, 13 Aug 2021 at 16:40, Prathamesh Kulkarni
>  wrote:
> >
> > On Thu, 5 Aug 2021 at 15:37, Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Thu, 29 Jul 2021 at 19:58, Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > Hi,
> > > > The attached patch replaces builtins in vld1_dup intrinsics with call
> > > > to corresponding vdup_n intrinsic and removes entry for vld1_dup from
> > > > arm_neon_builtins.def.
> > > > Bootstrapped+tested on arm-linux-gnueabihf.
> > > > OK to commit ?
> > > ping https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html
> > ping * 2 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html
> ping * 3 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html

Sorry for the slow response.
I don't think this approach improves anything. With the current setup we'd be 
guaranteeing generation of the load-and-dup instruction even at low 
optimisation levels, but with this change we'd be relying on RTL optimisers 
merging the load and dup together. I don't think it gains us anything?

Thanks,
Kyrill

> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh


Re: [PATCH] Fix a few problems with download_prerequisites.

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, Aug 24, 2021 at 8:04 AM apinski--- via Gcc-patches
 wrote:
>
> From: Andrew Pinski 
>
> There are a few problems with download_prerequisites are
> described in PR 82704.  The first is on busy-box version of
> shasum and md5sum the extended option --check don't exist
> so just use -c.  The second issue is the code for which
> shasum program to use is included twice and is different.
> So move which program to use for the checksum after argument
> parsing.  The last issue is --md5 option has been broken for
> sometime now as the program is named md5sum and not just md5.
> Nobody updated switch table to be correct.

OK

> contrib/ChangeLog:
>
> PR other/82704
> * download_prerequisites: Fix issues with --md5 and
> --sha512 options.
> ---
>  contrib/download_prerequisites | 59 
> +-
>  1 file changed, 30 insertions(+), 29 deletions(-)
>
> diff --git a/contrib/download_prerequisites b/contrib/download_prerequisites
> index 51e715f..8f69b61 100755
> --- a/contrib/download_prerequisites
> +++ b/contrib/download_prerequisites
> @@ -46,18 +46,6 @@ verify=1
>  force=0
>  OS=$(uname)
>
> -case $OS in
> -  "Darwin"|"FreeBSD"|"DragonFly"|"AIX")
> -chksum='shasum -a 512 --check'
> -  ;;
> -  "OpenBSD")
> -chksum='sha512 -c'
> -  ;;
> -  *)
> -chksum='sha512sum -c'
> -  ;;
> -esac
> -
>  if type wget > /dev/null ; then
>fetch='wget'
>  else
> @@ -113,7 +101,7 @@ do
>  done
>  unset arg
>
> -# Emulate Linux's 'md5 --check' on macOS
> +# Emulate Linux's 'md5sum --check' on macOS
>  md5_check() {
># Store the standard input: a line from contrib/prerequisites.md5:
>md5_checksum_line=$(cat -)
> @@ -162,26 +150,10 @@ do
>  verify=0
>  ;;
>  --sha512)
> -case $OS in
> -  "Darwin")
> -chksum='shasum -a 512 --check'
> -  ;;
> -  *)
> -chksum='sha512sum --check'
> -  ;;
> -esac
>  chksum_extension='sha512'
>  verify=1
>  ;;
>  --md5)
> -case $OS in
> -  "Darwin")
> -chksum='md5_check'
> -  ;;
> -  *)
> -chksum='md5 --check'
> -  ;;
> -esac
>  chksum_extension='md5'
>  verify=1
>  ;;
> @@ -212,6 +184,35 @@ done
>  [ "x${argnext}" = x ] || die "Missing argument for option --${argnext}"
>  unset arg argnext
>
> +case $chksum_extension in
> +  sha512)
> +case $OS in
> +  "Darwin"|"FreeBSD"|"DragonFly"|"AIX")
> +chksum='shasum -a 512 --check'
> +  ;;
> +  "OpenBSD")
> +chksum='sha512 -c'
> +  ;;
> +  *)
> +chksum='sha512sum -c'
> +  ;;
> +esac
> +  ;;
> +  md5)
> +case $OS in
> +  "Darwin")
> +chksum='md5_check'
> +  ;;
> +  *)
> +chksum='md5sum -c'
> +  ;;
> +esac
> +;;
> +  *)
> +die "Unkown checksum $chksum_extension"
> +  ;;
> +esac
> +
>  [ -e ./gcc/BASE-VER ]
>  \
>  || die "You must run this script in the top-level GCC source directory"
>
> --
> 1.8.3.1
>


Re: [ARM] PR66791: Replace builtins for signed vmul_n intrinsics

2021-08-24 Thread Prathamesh Kulkarni via Gcc-patches
On Fri, 13 Aug 2021 at 16:40, Prathamesh Kulkarni
 wrote:
>
> On Thu, 5 Aug 2021 at 15:44, Prathamesh Kulkarni
>  wrote:
> >
> > On Mon, 12 Jul 2021 at 15:24, Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Mon, 12 Jul 2021 at 15:23, Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Mon, 5 Jul 2021 at 14:47, Prathamesh Kulkarni
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > > This patch replaces builtins with __a * __b for signed variants of
> > > > > vmul_n intrinsics.
> > > > > As discussed earlier, the patch has issue if __a * __b overflows, and
> > > > > whether we wish to leave
> > > > > that as UB.
> > > > ping 
> > > > https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=6785eb595981abd93ad85edcfdf1d2e43c0841f5
> > > Oops sorry, I meant this link:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574428.html
> > ping * 2 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574428.html
> ping * 3 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574428.html
ping * 4 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh


Re: [ARM] PR66791: Replace builtin in vld1_dup intrinsics

2021-08-24 Thread Prathamesh Kulkarni via Gcc-patches
On Fri, 13 Aug 2021 at 16:40, Prathamesh Kulkarni
 wrote:
>
> On Thu, 5 Aug 2021 at 15:37, Prathamesh Kulkarni
>  wrote:
> >
> > On Thu, 29 Jul 2021 at 19:58, Prathamesh Kulkarni
> >  wrote:
> > >
> > > Hi,
> > > The attached patch replaces builtins in vld1_dup intrinsics with call
> > > to corresponding vdup_n intrinsic and removes entry for vld1_dup from
> > > arm_neon_builtins.def.
> > > Bootstrapped+tested on arm-linux-gnueabihf.
> > > OK to commit ?
> > ping https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html
> ping * 2 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html
ping * 3 https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576321.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-24 Thread Prathamesh Kulkarni via Gcc-patches
On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
 wrote:
>
> On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
>  wrote:
> >
> >
> >
> > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni 
> >  wrote:
> >>
> >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
> >>  wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches 
> >> >  wrote:
> >> >>
> >> >>
> >> >>
> >> >> > -Original Message-
> >> >> > From: Prathamesh Kulkarni 
> >> >> > Sent: 24 June 2021 12:11
> >> >> > To: gcc Patches ; Kyrylo Tkachov
> >> >> > 
> >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n 
> >> >> > intrinsics
> >> >> >
> >> >> > Hi,
> >> >> > This patch replaces builtins for vdup_n and vmov_n.
> >> >> > The patch results in regression for pr51534.c.
> >> >> > Consider following function:
> >> >> >
> >> >> > uint8x8_t f1 (uint8x8_t a) {
> >> >> >   return vcgt_u8(a, vdup_n_u8(0));
> >> >> > }
> >> >> >
> >> >> > code-gen before patch:
> >> >> > f1:
> >> >> > vmov.i32  d16, #0  @ v8qi
> >> >> > vcgt.u8 d0, d0, d16
> >> >> > bx lr
> >> >> >
> >> >> > code-gen after patch:
> >> >> > f1:
> >> >> > vceq.i8 d0, d0, #0
> >> >> > vmvnd0, d0
> >> >> > bx lr
> >> >> >
> >> >> > I am not sure which one is better tho ?
> >> >>
> >> >
> >> > Hi Prathamesh,
> >> >
> >> > This patch introduces a regression on non-hardfp configs (eg 
> >> > arm-linux-gnueabi or arm-eabi):
> >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
> >> > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
> >> > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
> >> >
> >> > Can you fix this?
> >> The issue is, for following test:
> >>
> >> #include 
> >>
> >> uint8x8_t f1 (uint8x8_t a) {
> >>   return vcge_u8(a, vdup_n_u8(0));
> >> }
> >>
> >> armhf code-gen:
> >> f1:
> >> vmov.i32  d0, #0x  @ v8qi
> >> bxlr
> >>
> >> arm softfp code-gen:
> >> f1:
> >> mov r0, #-1
> >> mov r1, #-1
> >> bx  lr
> >>
> >> The code-gen for both is same upto split2 pass:
> >>
> >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
> >> (const_vector:V8QI [
> >> (const_int -1 [0x]) repeated x8
> >> ])) "foo.c":5:1 1052 {*neon_movv8qi}
> >>  (expr_list:REG_EQUAL (const_vector:V8QI [
> >> (const_int -1 [0x]) repeated x8
> >> ])
> >> (nil)))
> >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
> >>  (nil))
> >>
> >> and for softfp target, split2 pass splits the assignment to r0 and r1:
> >>
> >> (insn 15 6 16 2 (set (reg:SI 0 r0)
> >> (const_int -1 [0x])) "foo.c":5:1 740 
> >> {*thumb2_movsi_vfp}
> >>  (nil))
> >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
> >> (const_int -1 [0x])) "foo.c":5:1 740 
> >> {*thumb2_movsi_vfp}
> >>  (nil))
> >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
> >>  (nil))
> >>
> >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
> >>
> > Yes, probably, or try with check-function-bodies.
> Hi,
> Sorry for the late response. Does the attached patch look OK ?
ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> >  Christophe
> >
> >> Thanks,
> >> Prathamesh
> >> >
> >> > Thanks
> >> >
> >> > Christophe
> >> >
> >> >
> >> >>
> >> >> I think they're equivalent in practice, in any case the patch itself is 
> >> >> good (move away from RTL builtins).
> >> >> Ok.
> >> >> Thanks,
> >> >> Kyrill
> >> >>
> >> >> >
> >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> >> >> > which is due to a missed opt in lowering. I had filed it as
> >> >> > PR98435, and posted a fix for it here:
> >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
> >> >> >
> >> >> > Thanks,
> >> >> > Prathamesh


Re: [PATCH] Reset PHI base0 flag if it's clear in any argument [PR101977]

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, Aug 24, 2021 at 1:41 AM Martin Sebor via Gcc-patches
 wrote:
>
> When determining the properties of objects referenced by a PHI's
> arguments, compute_objsize() has logic to filter out null pointers.
> It also has special logic that tries to deal with arguments that
> refer to the same object (as opposed to different objects).  A bug
> in the former prevents the function from clearing the flag called
> BASE0 that indicates that the identities of all the objects are
> known.  The latter logic turns out to be redundant but its presence
> make the logic in the function harder to follow.
>
> The attached patch corrects the former logic by resetting the BASE0
> flag for a PHI result if it's clear for any of its arguments.  It
> also does away with the latter logic, simplifying the code.  Testing
> the patch exposed a couple of other, minor, bugs in using an object's
> total size without considering an offset into it, and failing to reset
> members of reused access_ref objects.
>
> Tested on x86_64-linux.

OK.

Richard.

> Martin


Re: [PATCH] aix: handle 64bit inodes for include directories

2021-08-24 Thread CHIGOT, CLEMENT via Gcc-patches
>>> So my worry here is this is really a host property -- ie, this is
>>> behavior of where GCC runs, not the target for which GCC is generating code.
>>>
>>> That implies that the change in aix.h is wrong.  aix.h is for the
>>> target, not the host -- you don't want to define something like
>>> HOST_STAT_FOR_64BIT_INODES there.
>>>
>>> You'd want to be triggering this behavior via a host fragment, x-aix, or
>>> better yet via an autoconf test.
>> Indeed, would this version be better ? I'm not sure about the configure test.
>> But as we are retrieving the size of dev_t and ino_t just above, I'm assuming
>> that the one being used in stat directly. At least, that's the case on AIX, 
>> and
>> this test is only made for AIX.
> It's a clear improvement.  It's still checking for the aix target though:
>
> +# Select the right stat being able to handle 64bit inodes, if needed.
> +if test "$enable_largefile" != no; then
> +  case "$target" in
> +*-*-aix*)
> +  if test "$ac_cv_sizeof_ino_t" == "4" -a "$ac_cv_sizeof_dev_t" ==
> 4; then
> +
> +$as_echo "#define HOST_STAT_FOR_64BIT_INODES stat64x" >>confdefs.h
> +
> +  fi;;
> +  esac
> +fi
>
> Again, we're dealing with a host property.  You might be able to just
> change $target above to $host.  Hmm, that makes me wonder about canadian
> crosses where host != build.We may need to do this for both the aix
> host and aix build.

Yes, my bad, I've updated the case. I don't know if there is a usual way
to check both $build and $host. I've tried to avoid code duplication so
tell me if it's okay or if you'd rather have a case for $build and one
for $host.

Thanks,
Clément


0001-aix-handle-64bit-inodes-for-include-directories.patch
Description: 0001-aix-handle-64bit-inodes-for-include-directories.patch


Re: DWARF for extern variable

2021-08-24 Thread Richard Biener via Gcc-patches
On Mon, Aug 23, 2021 at 11:18 PM Indu Bhagat via Gcc-patches
 wrote:
>
> Hello,
>
> What is the expected DWARF for extern variable in the following cases? I
> am seeing that the DWARF generated is different with gcc8.4.1 vs gcc-trunk.
>
> Testcase 1
> --
> extern const char a[];
>
> int foo()
> {
>return a != 0;
> }
>
> Testcase 1 Behavior
> -
> - gcc-trunk has _no_ DWARF for variable a.

Use -fno-eliminate-unused-debug-symbols (the comparison is folded away
very early it seems).  Then we get similar DWARF as with GCC 8.

> - gcc8.4.1 generates following DW_TAG_variable for extern variable a.
> But does not designate it as a non-defining decl (IIUC,
> DW_AT_specification is used for such cases?).

It has DW_AT_declaration.

>
> <..>
>   <1><31>: Abbrev Number: 2 (DW_TAG_array_type)
>  <32>   DW_AT_type: <0x48>
>  <36>   DW_AT_sibling : <0x3c>
>   <2><3a>: Abbrev Number: 3 (DW_TAG_subrange_type)
>   <2><3b>: Abbrev Number: 0
>   <1><3c>: Abbrev Number: 4 (DW_TAG_const_type)
>  <3d>   DW_AT_type: <0x31>
>   <1><41>: Abbrev Number: 5 (DW_TAG_base_type)
>  <42>   DW_AT_byte_size   : 1
>  <43>   DW_AT_encoding: 6(signed char)
>  <44>   DW_AT_name: (indirect string, offset: 0x1df6): char
>   <1><48>: Abbrev Number: 4 (DW_TAG_const_type)
>  <49>   DW_AT_type: <0x41>
>   <1><4d>: Abbrev Number: 6 (DW_TAG_variable)
>  <4e>   DW_AT_name: a
>  <50>   DW_AT_decl_file   : 1
>  <51>   DW_AT_decl_line   : 1
>  <52>   DW_AT_decl_column : 19
>  <53>   DW_AT_type: <0x3c>
>  <57>   DW_AT_external: 1
>  <57>   DW_AT_declaration : 1
> <..>
>
> ---
>
> Testcase 2
> --
> extern const char a[];
> const char a[] = "testme";
>
> Testcase 2 Behavior
> 
> - Both gcc-trunk and gcc8.4.1 generate two DW_TAG_variable DIEs (the
> defining decl holds the reference to the non-defining decl via
> DW_AT_specification)
> - But gcc8.4.1 does not generate any DWARF for the type of the defining
> decl (const char[7]) but gcc-trunk does.
>
> ## DWARF for testcase 2 with gcc-trunk is as follows:
> <...>
>   <1><22>: Abbrev Number: 2 (DW_TAG_array_type)
>  <23>   DW_AT_type: <0x39>
>  <27>   DW_AT_sibling : <0x2d>
>   <2><2b>: Abbrev Number: 5 (DW_TAG_subrange_type)
>   <2><2c>: Abbrev Number: 0
>   <1><2d>: Abbrev Number: 1 (DW_TAG_const_type)
>  <2e>   DW_AT_type: <0x22>
>   <1><32>: Abbrev Number: 3 (DW_TAG_base_type)
>  <33>   DW_AT_byte_size   : 1
>  <34>   DW_AT_encoding: 6(signed char)
>  <35>   DW_AT_name: (indirect string, offset: 0x2035): char
>   <1><39>: Abbrev Number: 1 (DW_TAG_const_type)
>  <3a>   DW_AT_type: <0x32>
>   <1><3e>: Abbrev Number: 6 (DW_TAG_variable)
>  <3f>   DW_AT_name: a
>  <41>   DW_AT_decl_file   : 1
>  <42>   DW_AT_decl_line   : 1
>  <43>   DW_AT_decl_column : 19
>  <44>   DW_AT_type: <0x2d>
>  <48>   DW_AT_external: 1
>  <48>   DW_AT_declaration : 1
>   <1><48>: Abbrev Number: 2 (DW_TAG_array_type)
>  <49>   DW_AT_type: <0x39>
>  <4d>   DW_AT_sibling : <0x58>
>   <2><51>: Abbrev Number: 7 (DW_TAG_subrange_type)
>  <52>   DW_AT_type: <0x5d>
>  <56>   DW_AT_upper_bound : 6
>   <2><57>: Abbrev Number: 0
>   <1><58>: Abbrev Number: 1 (DW_TAG_const_type)
>  <59>   DW_AT_type: <0x48>
>   <1><5d>: Abbrev Number: 3 (DW_TAG_base_type)
>  <5e>   DW_AT_byte_size   : 8
>  <5f>   DW_AT_encoding: 7(unsigned)
>  <60>   DW_AT_name: (indirect string, offset: 0x2023): long
> unsigned int
>   <1><64>: Abbrev Number: 8 (DW_TAG_variable)
>  <65>   DW_AT_specification: <0x3e>
>  <69>   DW_AT_decl_line   : 2
>  <6a>   DW_AT_decl_column : 12
>  <6b>   DW_AT_type: <0x58>

I suppose having both a DW_AT_specification and a DW_AT_type
is somewhat at odds.  It's likely because the definition specifies
the size of the array while the specification does not.  Not sure
what should be best done here.

Richard.

>  <6f>   DW_AT_location: 9 byte block: 3 0 0 0 0 0 0 0 0
> (DW_OP_addr: 0)
>   <1><79>: Abbrev Number: 0
>
> ## DWARF for testcase 2 with gcc8.4.1 is as follows:
>   <1><21>: Abbrev Number: 2 (DW_TAG_array_type)
>  <22>   DW_AT_type: <0x38>
>  <26>   DW_AT_sibling : <0x2c>
>   <2><2a>: Abbrev Number: 3 (DW_TAG_subrange_type)
>   <2><2b>: Abbrev Number: 0
>   <1><2c>: Abbrev Number: 4 (DW_TAG_const_type)
>  <2d>   DW_AT_type: <0x21>
>   <1><31>: Abbrev Number: 5 (DW_TAG_base_type)
>  <32>   DW_AT_byte_size   : 1
>  <33>   DW_AT_encoding: 6(signed char)
>  <34>   DW_AT_name: (indirect string, offset: 0x1e04): char
>   <1><38>: Abbrev Number: 4 (DW_TAG_const_type)
>  <39>   DW_AT_type: <0x31>
>   

Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1

2021-08-24 Thread Xionghu Luo via Gcc-patches



On 2021/8/19 20:11, Richard Biener wrote:
>> -  class loop *inn_loop = loop;
>>   
>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>   {
>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
>> contains_call)
>>   to disprove this if possible).  */
>>if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>  break;
>> -
>> -  if (!flow_bb_inside_loop_p (inn_loop, bb))
>> -break;
>> -
>> -  if (bb->loop_father->header == bb)
>> -{
>> -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> -break;
>> -
>> -  /* In a loop that is always entered we may proceed anyway.
>> - But record that we entered it and stop once we leave it.  */
>> -  inn_loop = bb->loop_father;
>> -}
>>  }
>>   
>> while (1)
> I'm not sure this will work correct (I'm not sure how the existing
> code makes it so either...).  That said, I can't poke any hole
> into the change.  What I see is that definitely
> 
>if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>  last = bb;
> 
>if (bitmap_bit_p (contains_call, bb->index))
>  break;
> 
> doesn't work reliably since the DOM ordering will process blocks
> A B and C in random order for
> 
>for (;;)
> {
>if (cond)
>  {
>A: foo ();
>  }
>else B:;
>C:;
> }
> 
> and thus we can end up setting 'last' to C_before_  processing
> 'A' and thus arriving at the call foo () ...
> 
> get_loop_body_in_dom_order does some "special sauce" but not
> to address the above problem - but it might be that a subtle
> issue like the above is the reason for the inner loop handling.
> The inner loop block order does_not_  adhere to this "special sauce",
> that is - the "Additionally, if a basic block s dominates
> the latch, then only blocks dominated by s are be after it."
> guarantee holds for the outer loop latch, not for the inner.
> 
> Digging into the history of fill_always_executed_in_1 doesn't
> reveal anything - the inner loop handling has been present
> since introduction by Zdenek - but usually Zdenek has a reason
> for doing things as he does;)

Yes, this is really complicated usage, thanks for point it out. :)
I constructed two cases to verify this with inner loop includes "If A; else B; 
C". 
Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
C is postponed, the access order is ABC, 'last' won’t be set to C if A or B 
contains call;

Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
the access order is CAB, but 'last' also won’t be updated to C in 
fill_always_executed_in_1
since there is also dominate check, then if A or B contains call, it could break
successfully. 

C won't be set to ALWAYS EXECUTED for both circumstance.

> 
> Note it might be simply a measure against quadratic complexity,
> esp. since with your patch we also dive into not always executed
> subloops as you remove the
> 
>if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>  break;
> 
> check.  I suggest to evaluate behavior of the patch on a testcase
> like
> 
> void foo (int n, int **k)
> {
>for (int i = 0; i < n; ++i)
>  if (k[0][i])
>for (int j = 0; j < n; ++j)
>  if (k[1][j])
>for (int l = 0; l < n; ++l)
>  if (k[2][l])
>...
> }

Theoretically the complexity is changing from L1(bbs) to 
L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
so fill_always_executed_in_1's execution time is supposed to be increase from
O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also 
drafted a
test case has 73-depth loop function with 25 no-ipa function copies each 
compiled
in lim2 and lim4 dependently.  Total execution time of 
fill_always_executed_in_1 is
increased from 32ms to 58ms, almost doubled but not quadratic?

It seems reasonable to see compiling time getting longer since most bbs are 
checked
more but a MUST to ensure early break correctly in every loop level... 
Though loop nodes could be huge, loop depth will never be so large in actual 
code?

>  
> I suspect you'll see quadratic behavior with your patch.  You
> should be at least able to preserve a check like
> 
>/* Do not process not always executed subloops to avoid
>   quadratic behavior.  */
>if (bb->loop_father->header == bb
>&& !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>  break;
> 
> which is of course not optimistic for cases like
> 
>for (..)
> {
>   if (cond)
> for (..)
>   x = 1; // this is always executed if the inner loop is finite
> }
> 
> but we need to have an eye on the complexity of this function.  I would
> have suggested to do greedy visiting of the loop header successors,
> 

Re: [ping] Re-unify 'omp_build_component_ref' and 'oacc_build_component_ref'

2021-08-24 Thread Richard Biener via Gcc-patches
On Mon, Aug 23, 2021 at 4:30 PM Thomas Schwinge  wrote:
>
> Hi!
>
> On 2021-08-20T09:51:36+0200, Richard Biener  
> wrote:
> > On Thu, Aug 19, 2021 at 10:14 PM Thomas Schwinge
> >  wrote:
> >> Richard, maybe you have an opinion here, in particular about my
> >> "SLP vectorizer" comment below?  Please see
> >> <87r1f2puss.fsf@euler.schwinge.homeip.net">http://mid.mail-archive.com/87r1f2puss.fsf@euler.schwinge.homeip.net>
> >> for the full context.
> >>
> >> On 2021-08-16T10:21:04+0200, Jakub Jelinek  wrote:
> >> > On Mon, Aug 16, 2021 at 10:08:42AM +0200, Thomas Schwinge wrote:
> >> >>  /* Build COMPONENT_REF and set TREE_THIS_VOLATILE and TREE_READONLY on 
> >> >> it
> >> >> as appropriate.  */
> >> >>
> >> >>  tree
> >> >>  omp_build_component_ref (tree obj, tree field)
> >> >>  {
> >> >> +  tree field_type = TREE_TYPE (field);
> >> >> +  tree obj_type = TREE_TYPE (obj);
> >> >> +  if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (obj_type)))
> >> >> +field_type
> >> >> +  = build_qualified_type (field_type,
> >> >> +  KEEP_QUAL_ADDR_SPACE (TYPE_QUALS 
> >> >> (obj_type)));
> >>
> >> (For later reference: "Kwok's new code" here is to propagate to
> >> 'field_type' any non-generic address space of 'obj_type'.)
> >>
> >> |> Concerning the current 'gcc/omp-low.c:omp_build_component_ref', for the
> >> |> current set of offloading testcases, we never see a
> >> |> '!ADDR_SPACE_GENERIC_P' there, so the address space handling doesn't 
> >> seem
> >> |> to be necessary there (but also won't do any harm: no-op).
> >> >
> >> > Are you sure this can't trigger?
> >> > Say
> >> > extern int __seg_fs a;
> >> >
> >> > void
> >> > foo (void)
> >> > {
> >> >   #pragma omp parallel private (a)
> >> >   a = 2;
> >> > }
> >>
> >> That test case doesn't run into 'omp_build_component_ref' at all,
> >> but I'm attaching an altered and extended variant that does,
> >> "Add 'libgomp.c/address-space-1.c'".  OK to push to master branch?
> >>
> >> In this case, 'omp_build_component_ref' called via host compilation
> >> 'pass_lower_omp', it's the 'field_type' that has 'address-space-1', not
> >> 'obj_type', so indeed Kwok's new code is a no-op:
> >>
> >> (gdb) call debug_tree(field_type)
> >>   >> type  >> size 
> >> unit-size 
> >> align:32 warn_if_not_align:0 symtab:0 alias-set -1 
> >> canonical-type 0x77686498 precision:32 min  >> -2147483648> max 
> >> pointer_to_this >
> >> unsigned DI
> >> size  >> bitsizetype> constant 64>
> >> unit-size  >> 0x77559000 sizetype> constant 8>
> >> align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
> >> 0x77686b28>
> >>
> >> (gdb) call debug_tree(obj_type)
> >>   >> size  >> bitsizetype> constant 64>
> >> unit-size  >> 0x77559000 sizetype> constant 8>
> >> align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
> >> 0x77686bd0
> >> fields  >> type  >> 0x77686498 int address-space-1>
> >> unsigned DI size  unit-size 
> >> 
> >> align:64 warn_if_not_align:0 symtab:0 alias-set -1 
> >> canonical-type 0x77686b28>
> >> unsigned DI /home/thomas/shared/gcc/omp/as.c:4:14 size 
> >>  unit-size 
> >> align:64 warn_if_not_align:0 offset_align 128
> >> offset 
> >> bit-offset  context 
> >> > reference_to_this 
> >> >
> >>
> >> The case that Kwok's new code handles, however, is when 'obj_type' has a
> >> non-generic address space, and then propagates that one to 'field_type'.
> >>
> >> For a similar OpenACC example, 'omp_build_component_ref' called via GCN
> >> offloading compilation 'pass_omp_oacc_neuter_broadcast', we've got
> >> without Kwok's new code:
> >>
> >> (gdb) call debug_tree(field_type)
> >>   >> size  >> bitsizetype> constant 8>
> >> unit-size  >> 0x7755 sizetype> constant 1>
> >> align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
> >> 0x77550b28 precision:1 min  max 
> >> >
> >>
> >> (gdb) call debug_tree(obj_type)
> >>   >> size  >> bitsizetype> constant 8>
> >> unit-size  >> 0x7755 sizetype> constant 1>
> >> align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
> >> 0x77631000
> >> fields  >> type  >>  unit-size 
> >> align:8 warn_if_not_align:0 symtab:0 alias-set -1 
> >> canonical-type 0x77550b28 precision:1 min  >> 0> max >
> >> unsigned QI :0:0 size  
> >> unit-size 
> >> align:8 warn_if_not_align:0 offset_align 64
> >> offset 
> >> bit-offset  context 
> >> >
> >> pointer_to_this >
> >>
> >> ..., and with Kwok's new code the 'address-space-4' of 'obj_type' is
> >> propagated to 'field_type':
> >>
> >> (gdb) call debug_tree(field_type)
> >>   >> size  >> 

Re: [patch, libgfortran] Further fixes for GFC/CFI descriptor conversions

2021-08-24 Thread Tobias Burnus

Hi Sandra,

On 19.08.21 05:57, Sandra Loosemore wrote:

This patch addresses several bugs in converting from GFC to CFI
descriptors and vice versa. [...]

The root of all the problems addressed here is that GFC descriptors
contain incomplete information; in particular, they only encode the
size of the data type and not its kind.


I think that's not a fundamental problem – but because the conversion
happens in libgfortran and not in FE generated code. In the FE, the data
type is known – hence, it could be handled properly.

We do definitely plan to move the descriptor conversion to the FE, which
also solves severe alias issues. However, until that patch is ready, we
rely on doing it in libgfortran ...


 libgfortran: Further fixes for GFC/CFI descriptor conversions.

 This patch is for:
 PR100907 - Bind(c): failure handling wide character
 PR100911 - Bind(c): failure handling C_PTR
 PR100914 - Bind(c): errors handling complex
 PR100915 - Bind(c): failure handling C_FUNPTR
 PR100917 - Bind(c): errors handling long double real

[...]
 The Fortran front end does not distinguish between C_PTR and
 C_FUNPTR, mapping both onto BT_VOID.  That is what this patch does also.


Looks like another thing to improve once we moved the conversion code to
the FE.


 2021-08-18  Sandra Loosemore
  José Rui Faustino de Sousa

 gcc/testsuite/
  PR fortran/100911
  PR fortran/100915
  PR fortran/100916
  * gfortran.dg/PR100911.c: New file.
  * gfortran.dg/PR100911.f90: New file.
  * gfortran.dg/PR100914.c: New file.
  * gfortran.dg/PR100914.f90: New file.
  * gfortran.dg/PR100915.c: New file.
  * gfortran.dg/PR100915.f90: New file.

 libgfortran/
  PR fortran/100907
  PR fortran/100911
  PR fortran/100914
  PR fortran/100915
  PR fortran/100917
  * ISO_Fortran_binding.h (CFI_type_cfunptr): Make equivalent to
  CFI_type_cptr.
  * runtime/ISO_Fortran_binding.c (cfi_desc_to_gfc_desc): Fix
  handling of CFI_type_cptr and CFI_type_cfunptr.  Additional error
  checking and code cleanup.
  (gfc_desc_to_cfi_desc): Likewise.  Also correct kind mapping
  for character, complex, and long double types.

...

+++ b/libgfortran/runtime/ISO_Fortran_binding.c
@@ -37,15 +37,16 @@ export_proto(cfi_desc_to_gfc_desc);
  void
  cfi_desc_to_gfc_desc (gfc_array_void *d, CFI_cdesc_t **s_ptr)
  {
+  signed char type;

...

+  GFC_DESCRIPTOR_TYPE (d) = (signed char)type;

No need for a cast.

+case CFI_type_Character:
+  /* FIXME: we can't distinguish between kind/len because
+  the GFC descriptor only encodes the elem_len..
+  Until PR92482 is fixed, assume elem_len refers to the
+  character size and not the string length.  */
+  kind = (signed char)d->elem_len;
+  break;


I wonder what's more common – kind=4 or len > 1.
My gut feeling is that it is len > 1 but the issue will be gone once we
moved the code to the FE.
Your patch shows even more reasons for doing so ...

LGTM.

Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH][v2] Remove --param vect-inner-loop-cost-factor

2021-08-24 Thread Richard Biener via Gcc-patches
On Tue, 24 Aug 2021, Kewen.Lin wrote:

> Hi Richi,
> 
> on 2021/8/23 ??10:33, Richard Biener via Gcc-patches wrote:
> > This removes --param vect-inner-loop-cost-factor in favor of looking
> > at the estimated number of iterations of the inner loop
> > when available and otherwise just assumes a single inner
> > iteration which is conservative on the side of not vectorizing.
> > 
> 
> I may miss something, the factor seems to be an amplifier, a single
> inner iteration on the side of not vectorizing only relies on that
> vector_cost < scalar_cost, if scalar_cost < vector_cost, the direction
> will be flipped? ({vector,scalar}_cost is only for inner loop part).
> 
> Since we don't calculate/compare costing for inner loop independently
> and early return if scalar_cost < vector_cost for inner loop, I guess
> it's possible to have "scalar_cost < vector_cost" case theoretically,
> especially when targets can cost something more on vector side.

True.

> > The alternative is to retain the --param for exactly that case,
> > not sure if the result is better or not.  The --param is new on
> > head, it was static '50' before.
> > 
> 
> I think the intention of --param is to offer ports a way to tweak
> it (no ports do it for now though :)).  Not sure how target costing
> is sensitive to this factor, but I also prefer to make its default
> value as 50 as Honza suggested to avoid more possible tweakings.
> 
> If targets want more, maybe we can extend it to:
> 
> default_hook:
>   return estimated or likely_max if either is valid;
>   return default value;
>   
> target hook:
>   val = default_hook; // or from scratch
>   tweak the val as it wishes;  
> 
> I guess there is no this need for now.
>
> > Any strong opinions?
> > 
> > Richard.
> > 
> > 2021-08-23  Richard Biener  
> > 
> > * doc/invoke.texi (vect-inner-loop-cost-factor): Remove
> > documentation.
> > * params.opt (--param vect-inner-loop-cost-factor): Remove.
> > * tree-vect-loop.c (_loop_vec_info::_loop_vec_info):
> > Initialize inner_loop_cost_factor to 1.
> > (vect_analyze_loop_form): Initialize inner_loop_cost_factor
> > from the estimated number of iterations of the inner loop.
> > ---
> >  gcc/doc/invoke.texi  |  5 -
> >  gcc/params.opt   |  4 
> >  gcc/tree-vect-loop.c | 12 +++-
> >  3 files changed, 11 insertions(+), 10 deletions(-)
> > 
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index c057cc1e4ae..054950132f6 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -14385,11 +14385,6 @@ code to iterate.  2 allows partial vector loads 
> > and stores in all loops.
> >  The parameter only has an effect on targets that support partial
> >  vector loads and stores.
> >  
> > -@item vect-inner-loop-cost-factor
> > -The factor which the loop vectorizer applies to the cost of statements
> > -in an inner loop relative to the loop being vectorized.  The default
> > -value is 50.
> > -
> >  @item avoid-fma-max-bits
> >  Maximum number of bits for which we avoid creating FMAs.
> >  
> > diff --git a/gcc/params.opt b/gcc/params.opt
> > index f9264887b40..f7b19fa430d 100644
> > --- a/gcc/params.opt
> > +++ b/gcc/params.opt
> > @@ -1113,8 +1113,4 @@ Bound on number of runtime checks inserted by the 
> > vectorizer's loop versioning f
> >  Common Joined UInteger Var(param_vect_partial_vector_usage) Init(2) 
> > IntegerRange(0, 2) Param Optimization
> >  Controls how loop vectorizer uses partial vectors.  0 means never, 1 means 
> > only for loops whose need to iterate can be removed, 2 means for all loops. 
> >  The default value is 2.
> >  
> > --param=vect-inner-loop-cost-factor=
> > -Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) 
> > IntegerRange(1, 99) Param Optimization
> > -The factor which the loop vectorizer applies to the cost of statements in 
> > an inner loop relative to the loop being vectorized.
> > -
> >  ; This comment is to ensure we retain the blank line above.
> > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> > index c521b43a47c..cb48717f20e 100644
> > --- a/gcc/tree-vect-loop.c
> > +++ b/gcc/tree-vect-loop.c
> > @@ -841,7 +841,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> > vec_info_shared *shared)
> >  single_scalar_iteration_cost (0),
> >  vec_outside_cost (0),
> >  vec_inside_cost (0),
> > -inner_loop_cost_factor (param_vect_inner_loop_cost_factor),
> > +inner_loop_cost_factor (1),
> >  vectorizable (false),
> >  can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
> >  using_partial_vectors_p (false),
> > @@ -1519,6 +1519,16 @@ vect_analyze_loop_form (class loop *loop, 
> > vec_info_shared *shared)
> >stmt_vec_info inner_loop_cond_info
> > = loop_vinfo->lookup_stmt (inner_loop_cond);
> >STMT_VINFO_TYPE (inner_loop_cond_info) = 
> > loop_exit_ctrl_vec_info_type;
> > +  /* If we have an estimate on the number of iterations of the 

[PATCH] Fix a few problems with download_prerequisites.

2021-08-24 Thread apinski--- via Gcc-patches
From: Andrew Pinski 

There are a few problems with download_prerequisites are
described in PR 82704.  The first is on busy-box version of
shasum and md5sum the extended option --check don't exist
so just use -c.  The second issue is the code for which
shasum program to use is included twice and is different.
So move which program to use for the checksum after argument
parsing.  The last issue is --md5 option has been broken for
sometime now as the program is named md5sum and not just md5.
Nobody updated switch table to be correct.

contrib/ChangeLog:

PR other/82704
* download_prerequisites: Fix issues with --md5 and
--sha512 options.
---
 contrib/download_prerequisites | 59 +-
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/contrib/download_prerequisites b/contrib/download_prerequisites
index 51e715f..8f69b61 100755
--- a/contrib/download_prerequisites
+++ b/contrib/download_prerequisites
@@ -46,18 +46,6 @@ verify=1
 force=0
 OS=$(uname)
 
-case $OS in
-  "Darwin"|"FreeBSD"|"DragonFly"|"AIX")
-chksum='shasum -a 512 --check'
-  ;;
-  "OpenBSD")
-chksum='sha512 -c'
-  ;;
-  *)
-chksum='sha512sum -c'
-  ;;
-esac
-
 if type wget > /dev/null ; then
   fetch='wget'
 else
@@ -113,7 +101,7 @@ do
 done
 unset arg
 
-# Emulate Linux's 'md5 --check' on macOS
+# Emulate Linux's 'md5sum --check' on macOS
 md5_check() {
   # Store the standard input: a line from contrib/prerequisites.md5:
   md5_checksum_line=$(cat -)
@@ -162,26 +150,10 @@ do
 verify=0
 ;;
 --sha512)
-case $OS in
-  "Darwin")
-chksum='shasum -a 512 --check'
-  ;;
-  *)
-chksum='sha512sum --check'
-  ;;
-esac
 chksum_extension='sha512'
 verify=1
 ;;
 --md5)
-case $OS in
-  "Darwin")
-chksum='md5_check'
-  ;;
-  *)
-chksum='md5 --check'
-  ;;
-esac
 chksum_extension='md5'
 verify=1
 ;;
@@ -212,6 +184,35 @@ done
 [ "x${argnext}" = x ] || die "Missing argument for option --${argnext}"
 unset arg argnext
 
+case $chksum_extension in
+  sha512)
+case $OS in
+  "Darwin"|"FreeBSD"|"DragonFly"|"AIX")
+chksum='shasum -a 512 --check'
+  ;;
+  "OpenBSD")
+chksum='sha512 -c'
+  ;;
+  *)
+chksum='sha512sum -c'
+  ;;
+esac
+  ;;
+  md5)
+case $OS in
+  "Darwin")
+chksum='md5_check'
+  ;;
+  *)
+chksum='md5sum -c'
+  ;;
+esac
+;;
+  *)
+die "Unkown checksum $chksum_extension"
+  ;;
+esac
+
 [ -e ./gcc/BASE-VER ] \
 || die "You must run this script in the top-level GCC source directory"
 
-- 
1.8.3.1