PING^3 [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-11-21 Thread Kewen.Lin via Gcc-patches
Hi,

As the discussions and the testing result under the main thread, this
patch would be safe.

Ping for this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580357.html

BR,
Kewen


>> on 2021/9/28 下午4:13, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> As the discussion in PR102347, currently builtin_decl is invoked so
>>> early, it's when making up the function_decl for builtin functions,
>>> at that time the rs6000_builtin_mask could be wrong for those
>>> builtins sitting in #pragma/attribute target functions, though it
>>> will be updated properly later when LTO processes all nodes.
>>>
>>> This patch is to align with the practice i386 port adopts, also
>>> align with r10-7462 by relaxing builtin mask checking in some places.
>>>
>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu P9 and
>>> powerpc64-linux-gnu P8.
>>>
>>> Is it ok for trunk?
>>>
>>> BR,
>>> Kewen
>>> -
>>> gcc/ChangeLog:
>>>
>>> PR target/102347
>>> * config/rs6000/rs6000-call.c (rs6000_builtin_decl): Remove builtin
>>> mask check.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> PR target/102347
>>> * gcc.target/powerpc/pr102347.c: New test.
>>>
>>> ---
>>>  gcc/config/rs6000/rs6000-call.c | 14 --
>>>  gcc/testsuite/gcc.target/powerpc/pr102347.c | 15 +++
>>>  2 files changed, 19 insertions(+), 10 deletions(-)
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102347.c
>>>
>>> diff --git a/gcc/config/rs6000/rs6000-call.c 
>>> b/gcc/config/rs6000/rs6000-call.c
>>> index fd7f24da818..15e0e09c07d 100644
>>> --- a/gcc/config/rs6000/rs6000-call.c
>>> +++ b/gcc/config/rs6000/rs6000-call.c
>>> @@ -13775,23 +13775,17 @@ rs6000_init_builtins (void)
>>>  }
>>>  }
>>>
>>> -/* Returns the rs6000 builtin decl for CODE.  */
>>> +/* Returns the rs6000 builtin decl for CODE.  Note that we don't check
>>> +   the builtin mask here since there could be some #pragma/attribute
>>> +   target functions and the rs6000_builtin_mask could be wrong when
>>> +   this checking happens, though it will be updated properly later.  */
>>>
>>>  tree
>>>  rs6000_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
>>>  {
>>> -  HOST_WIDE_INT fnmask;
>>> -
>>>if (code >= RS6000_BUILTIN_COUNT)
>>>  return error_mark_node;
>>>
>>> -  fnmask = rs6000_builtin_info[code].mask;
>>> -  if ((fnmask & rs6000_builtin_mask) != fnmask)
>>> -{
>>> -  rs6000_invalid_builtin ((enum rs6000_builtins)code);
>>> -  return error_mark_node;
>>> -}
>>> -
>>>return rs6000_builtin_decls[code];
>>>  }
>>>
>>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr102347.c 
>>> b/gcc/testsuite/gcc.target/powerpc/pr102347.c
>>> new file mode 100644
>>> index 000..05c439a8dac
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/powerpc/pr102347.c
>>> @@ -0,0 +1,15 @@
>>> +/* { dg-do link } */
>>> +/* { dg-require-effective-target power10_ok } */
>>> +/* { dg-require-effective-target lto } */
>>> +/* { dg-options "-flto -mdejagnu-cpu=power9" } */
>>> +
>>> +/* Verify there are no error messages in LTO mode.  */
>>> +
>>> +#pragma GCC target "cpu=power10"
>>> +int main ()
>>> +{
>>> +  float *b;
>>> +  __vector_quad c;
>>> +  __builtin_mma_disassemble_acc (b, );
>>> +  return 0;
>>> +}
>>> --
>>> 2.27.0
>>>
>>



PING^6 [PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-11-21 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html

One related patch [1] is ready to commit, whose test cases rely on
this patch if no changes are applied to them.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579658.html

BR,
Kewen

>>>>> on 2021/9/1 下午2:55, Kewen.Lin via Gcc-patches wrote:
>>>>>> Hi!
>>>>>>
>>>>>> This patch is to fix the inconsistent behaviors for non-LTO mode
>>>>>> and LTO mode.  As Martin pointed out, currently the function
>>>>>> rs6000_can_inline_p simply makes it inlinable if callee_tree is
>>>>>> NULL, but it's wrong, we should use the command line options
>>>>>> from target_option_default_node as default.  It also replaces
>>>>>> rs6000_isa_flags with the one from target_option_default_node
>>>>>> when caller_tree is NULL as rs6000_isa_flags could probably
>>>>>> change since initialization.
>>>>>>
>>>>>> It also extends the scope of the check for the case that callee
>>>>>> has explicit set options, for test case pr102059-2.c inlining can
>>>>>> happen unexpectedly before, it's fixed accordingly.
>>>>>>
>>>>>> As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
>>>>>> can be neglected for inlining, this patch also exludes them when
>>>>>> the callee is attributed by always_inline.
>>>>>>
>>>>>> Bootstrapped and regtested on powerpc64le-linux-gnu Power9.
>>>>>>
>>>>>> BR,
>>>>>> Kewen
>>>>>> -
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>>  PR ipa/102059
>>>>>>  * config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
>>>>>>  target_option_default_node and consider always_inline_safe flags.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>>  PR ipa/102059
>>>>>>  * gcc.target/powerpc/pr102059-1.c: New test.
>>>>>>  * gcc.target/powerpc/pr102059-2.c: New test.
>>>>>>  * gcc.target/powerpc/pr102059-3.c: New test.
>>>>>>  * gcc.target/powerpc/pr102059-4.c: New test.
>>>>>>
>>>>>



PING^4 [PATCH v2] rs6000: Modify the way for extra penalized cost

2021-11-21 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

>>> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> This patch follows the discussions here[1][2], where Segher
>>>> pointed out the existing way to guard the extra penalized
>>>> cost for strided/elementwise loads with a magic bound does
>>>> not scale.
>>>>
>>>> The way with nunits * stmt_cost can get one much
>>>> exaggerated penalized cost, such as: for V16QI on P8, it's
>>>> 16 * 20 = 320, that's why we need one bound.  To make it
>>>> better and more readable, the penalized cost is simplified
>>>> as:
>>>>
>>>> unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>>> unsigned extra_cost = nunits * adjusted_cost;
>>>>
>>>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>>>> while for the other modes, it uses 1.  It's mainly concluded
>>>> from the performance evaluations.  One thing might be
>>>> related is that: More units vector gets constructed, more
>>>> instructions are used.  It has more chances to schedule them
>>>> better (even run in parallelly when enough available units
>>>> at that time), so it seems reasonable not to penalize more
>>>> for them.
>>>>
>>>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>>>> sets O2-vect and Ofast-unroll show this change is neutral.
>>>>
>>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>>
>>>> Is it ok for trunk?
>>>>
>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>>>
>>>> BR,
>>>> Kewen
>>>> -
>>>> gcc/ChangeLog:
>>>>
>>>>* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>>>the way to compute extra penalized cost.  Remove useless parameter.
>>>>(rs6000_add_stmt_cost): Adjust the call to function
>>>>rs6000_update_target_cost_per_stmt.
>>>>
>>>>
>>>> ---
>>>>  gcc/config/rs6000/rs6000.c | 31 ++-
>>>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>>>> index dd42b0964f1..8200e1152c2 100644
>>>> --- a/gcc/config/rs6000/rs6000.c
>>>> +++ b/gcc/config/rs6000/rs6000.c
>>>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
>>>> *data,
>>>>enum vect_cost_for_stmt kind,
>>>>struct _stmt_vec_info *stmt_info,
>>>>enum vect_cost_model_location where,
>>>> -  int stmt_cost,
>>>>unsigned int orig_count)
>>>>  {
>>>>
>>>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt 
>>>> (rs6000_cost_data *data,
>>>>{
>>>>  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>>>  unsigned int nunits = vect_nunits_for_cost (vectype);
>>>> -unsigned int extra_cost = nunits * stmt_cost;
>>>> -/* As function rs6000_builtin_vectorization_cost shows, we have
>>>> -   priced much on V16QI/V8HI vector construction as their units,
>>>> -   if we penalize them with nunits * stmt_cost, it can result in
>>>> -   an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>>> -   is 20 and nunits is 16, the extra cost is 320 which looks
>>>> -   much exaggerated.  So let's use one maximum bound for the
>>>> -   extra penalized cost for vector construction here.  */
>>>> -const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>>> -if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>>> -  extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>>>> +/* Don't expect strided/elementwise loads for just 1 nunit.  */
>>>> +gcc_assert (nunits > 1);
>>>> +/* i386 port adopts nunits * stmt_cost as the penalized cost
>>>> +   for this kind of penal

PING^7 [PATCH v2] combine: Tweak the condition of last_set invalidation

2021-11-21 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html

BR,
Kewen

>>>>>> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote:
>>>>>>> Hi Segher,
>>>>>>>
>>>>>>> Thanks for the review!
>>>>>>>
>>>>>>> on 2021/6/10 上午4:17, Segher Boessenkool wrote:
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
>>>>>>>>> Currently we have the check:
>>>>>>>>>
>>>>>>>>>   if (!insn
>>>>>>>>> || (value && rsp->last_set_table_tick >= 
>>>>>>>>> label_tick_ebb_start))
>>>>>>>>>   rsp->last_set_invalid = 1; 
>>>>>>>>>
>>>>>>>>> which means if we want to record some value for some reg and
>>>>>>>>> this reg got refered before in a valid scope,
>>>>>>>>
>>>>>>>> If we already know it is *set* in this same extended basic block.
>>>>>>>> Possibly by the same instruction btw.
>>>>>>>>
>>>>>>>>> we invalidate the
>>>>>>>>> set of reg (last_set_invalid to 1).  It avoids to find the wrong
>>>>>>>>> set for one reg reference, such as the case like:
>>>>>>>>>
>>>>>>>>>... op regX  // this regX could find wrong last_set below
>>>>>>>>>regX = ...   // if we think this set is valid
>>>>>>>>>... op regX
>>>>>>>>
>>>>>>>> Yup, exactly.
>>>>>>>>
>>>>>>>>> But because of retry's existence, the last_set_table_tick could
>>>>>>>>> be set by some later reference insns, but we see it's set due
>>>>>>>>> to retry on the set (for that reg) insn again, such as:
>>>>>>>>>
>>>>>>>>>insn 1
>>>>>>>>>insn 2
>>>>>>>>>
>>>>>>>>>regX = ... --> (a)
>>>>>>>>>... op regX--> (b)
>>>>>>>>>
>>>>>>>>>insn 3
>>>>>>>>>
>>>>>>>>>// assume all in the same BB.
>>>>>>>>>
>>>>>>>>> Assuming we combine 1, 2 -> 3 sucessfully and replace them as two
>>>>>>>>> (3 insns -> 2 insns),
>>>>>>>>
>>>>>>>> This will delete insn 1 and write the combined result to insns 2 and 3.
>>>>>>>>
>>>>>>>>> retrying from insn1 or insn2 again:
>>>>>>>>
>>>>>>>> Always 2, but your point remains valid.
>>>>>>>>
>>>>>>>>> it will scan insn (a) again, the below condition holds for regX:
>>>>>>>>>
>>>>>>>>>   (value && rsp->last_set_table_tick >= label_tick_ebb_start)
>>>>>>>>>
>>>>>>>>> it will mark this set as invalid set.  But actually the
>>>>>>>>> last_set_table_tick here is set by insn (b) before retrying, so it
>>>>>>>>> should be safe to be taken as valid set.
>>>>>>>>
>>>>>>>> Yup.
>>>>>>>>
>>>>>>>>> This proposal is to check whether the last_set_table safely happens
>>>>>>>>> after the current set, make the set still valid if so.
>>>>>>>>
>>>>>>>>> Full SPEC2017 building shows this patch gets more sucessful combines
>>>>>>>>> from 1902208 to 1902243 (trivial though).
>>>>>>>>
>>>>>>>> Do you have some example, or maybe even a testcase?  :-)
>>>>>>>>
>>>>>>>
>>>>>>> Sorry for the late reply, it took some time to get one reduced case.
>>>>>>>
>>>>>>> typedef struct SA *pa_t;
>>>>>>>
>>>>>>> struct SC {
>>>>>>>   int h;
>>>>&g

Re: [PATCH 07/15] xtensa: Fix non-robust split condition in define_insn_and_split

2021-11-21 Thread Kewen.Lin via Gcc-patches
on 2021/11/20 上午6:26, augustine.sterl...@gmail.com wrote:
> On Thu, Nov 11, 2021 at 3:25 AM Kewen Lin  wrote:
>> gcc/ChangeLog:
>>
>> * config/xtensa/xtensa.md (movdi_internal, movdf_internal): Fix split
>> condition.
> 
> I had been hoping Max would reply (as I'm just doing legacy work
> around this these days), but seeing that he hasn't. This is approved.
> 

Thanks Augustine!  Committed as r12-5444.

BR,
Kewen


Re: [PATCH 12/15] i386: Fix non-robust split condition in define_insn_and_split

2021-11-17 Thread Kewen.Lin via Gcc-patches
Hi Uros,

on 2021/11/17 下午3:13, Uros Bizjak wrote:
> On Thu, Nov 11, 2021 at 12:25 PM Kewen Lin  wrote:
>>
>> This patch is to fix some non-robust split conditions in some
>> define_insn_and_splits, to make each of them applied on top of
>> the corresponding condition for define_insn part, otherwise the
>> splitting could perform unexpectedly.
>>
>> gcc/ChangeLog:
>>
>> * config/i386/i386.md (*add3_doubleword, *addv4_doubleword,
>> *addv4_doubleword_1, *sub3_doubleword,
>> *subv4_doubleword, *subv4_doubleword_1,
>> *add3_doubleword_cc_overflow_1, *divmodsi4_const,
>> *neg2_doubleword, *tls_dynamic_gnu2_combine_64_): Fix 
>> split
>> condition.
> 
> OK.
> 

Thanks!  Committed as r12-5334.

BR,
Kewen

> Thanks,
> Uros.
> 
>> ---
>>  gcc/config/i386/i386.md | 20 ++--
>>  1 file changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
>> index 6eb9de81921..2bd09e502ae 100644
>> --- a/gcc/config/i386/i386.md
>> +++ b/gcc/config/i386/i386.md
>> @@ -5491,7 +5491,7 @@ (define_insn_and_split "*add3_doubleword"
>> (clobber (reg:CC FLAGS_REG))]
>>"ix86_binary_operator_ok (PLUS, mode, operands)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CCC FLAGS_REG)
>>(compare:CCC
>>  (plus:DWIH (match_dup 1) (match_dup 2))
>> @@ -6300,7 +6300,7 @@ (define_insn_and_split "*addv4_doubleword"
>> (plus: (match_dup 1) (match_dup 2)))]
>>"ix86_binary_operator_ok (PLUS, mode, operands)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CCC FLAGS_REG)
>>(compare:CCC
>>  (plus:DWIH (match_dup 1) (match_dup 2))
>> @@ -6347,7 +6347,7 @@ (define_insn_and_split "*addv4_doubleword_1"
>> && CONST_SCALAR_INT_P (operands[2])
>> && rtx_equal_p (operands[2], operands[3])"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CCC FLAGS_REG)
>>(compare:CCC
>>  (plus:DWIH (match_dup 1) (match_dup 2))
>> @@ -6641,7 +6641,7 @@ (define_insn_and_split "*sub3_doubleword"
>> (clobber (reg:CC FLAGS_REG))]
>>"ix86_binary_operator_ok (MINUS, mode, operands)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CC FLAGS_REG)
>>(compare:CC (match_dup 1) (match_dup 2)))
>>   (set (match_dup 0)
>> @@ -6817,7 +6817,7 @@ (define_insn_and_split "*subv4_doubleword"
>> (minus: (match_dup 1) (match_dup 2)))]
>>"ix86_binary_operator_ok (MINUS, mode, operands)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CC FLAGS_REG)
>>(compare:CC (match_dup 1) (match_dup 2)))
>>   (set (match_dup 0)
>> @@ -6862,7 +6862,7 @@ (define_insn_and_split "*subv4_doubleword_1"
>> && CONST_SCALAR_INT_P (operands[2])
>> && rtx_equal_p (operands[2], operands[3])"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CC FLAGS_REG)
>>(compare:CC (match_dup 1) (match_dup 2)))
>>   (set (match_dup 0)
>> @@ -7542,7 +7542,7 @@ (define_insn_and_split 
>> "*add3_doubleword_cc_overflow_1"
>> (plus: (match_dup 1) (match_dup 2)))]
>>"ix86_binary_operator_ok (PLUS, mode, operands)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel [(set (reg:CCC FLAGS_REG)
>>(compare:CCC
>>  (plus:DWIH (match_dup 1) (match_dup 2))
>> @@ -9000,7 +9000,7 @@ (define_insn_and_split "*divmodsi4_const"
>> (clobber (reg:CC FLAGS_REG))]
>>"!optimize_function_for_size_p (cfun)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(set (match_dup 0) (match_dup 2))
>> (set (match_dup 1) (match_dup 4))
>> (parallel [(set (match_dup 0)
>> @@ -10515,7 +10515,7 @@ (define_insn_and_split "*neg2_doubleword"
>> (clobber (reg:CC FLAGS_REG))]
>>"ix86_unary_operator_ok (NEG, mode, operands)"
>>"#"
>> -  "reload_completed"
>> +  "&& reload_completed"
>>[(parallel
>>  [(set (reg:CCC FLAGS_REG)
>>   (ne:CCC (match_dup 1) (const_int 0)))
>> @@ -16898,7 +16898,7 @@ (define_insn_and_split 
>> "*tls_dynamic_gnu2_combine_64_"
>> (clobber (reg:CC FLAGS_REG))]
>>"TARGET_64BIT && TARGET_GNU2_TLS"
>>"#"
>> -  ""
>> +  "&& 1"
>>[(set (match_dup 0) (match_dup 4))]
>>  {
>>operands[4] = can_create_pseudo_p () ? gen_reg_rtx (ptr_mode) : 
>> operands[0];
>> --
>> 2.27.0
>>



Re: [PATCH 06/15] visium: Fix non-robust split condition in define_insn_and_split

2021-11-16 Thread Kewen.Lin via Gcc-patches
Hi Eric,

on 2021/11/17 上午12:57, Eric Botcazou wrote:
>> gcc/ChangeLog:
>>
>>  * config/visium/visium.md (*add3_insn, *addsi3_insn, *addi3_insn,
>>  *sub3_insn, *subsi3_insn, *subdi3_insn, *neg2_insn,
>>  *negdi2_insn, *and3_insn, *ior3_insn, *xor3_insn,
>>  *one_cmpl2_insn, *ashl3_insn, *ashr3_insn,
>>  *lshr3_insn, *trunchiqi2_insn, *truncsihi2_insn,
>>  *truncdisi2_insn, *extendqihi2_insn, *extendqisi2_insn,
>>  *extendhisi2_insn, *extendsidi2_insn, *zero_extendqihi2_insn,
>>*zero_extendqisi2_insn, *zero_extendsidi2_insn): Fix split condition.
> 
> OK for mainline, thanks.
> 

Thanks!  Committed as r12-5332.

BR,
Kewen


Re: [PATCH] rs6000/doc: Rename future cpu with power10

2021-11-11 Thread Kewen.Lin via Gcc-patches
on 2021/11/10 下午6:03, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Nov 10, 2021 at 05:39:27PM +0800, Kewen.Lin wrote:
>> @@ -27779,10 +27779,10 @@ Enable/disable the @var{__float128} keyword for 
>> IEEE 128-bit floating point
>>  and use either software emulation for IEEE 128-bit floating point or
>>  hardware instructions.
>>
>> -The VSX instruction set (@option{-mvsx}, @option{-mcpu=power7},
>> -@option{-mcpu=power8}), or @option{-mcpu=power9} must be enabled to
>> -use the IEEE 128-bit floating point support.  The IEEE 128-bit
>> -floating point support only works on PowerPC Linux systems.
>> +The VSX instruction set (@option{-mvsx}, @option{-mcpu=power7} (or later
>> +@var{cpu_type})) must be enabled to use the IEEE 128-bit floating point
>> +support.  The IEEE 128-bit floating point support only works on PowerPC
>> +Linux systems.
> 
> I'd just say -mvsx.  This is default on for -mcpu=power7 and later, and
> cannot be enabled elsewhere, but that is beside the point.
> 
> If you say more than the essentials here it becomes harder to read
> (simply because there is more to read then), harder to find what you
> are looking for, and harder to keep it updated if things change (like
> what this patch is for :-) )
> 
> The part about "works only on Linux" isn't quite true.  "Is only
> supported on Linux" is a bit better.
> 
>>  Generate (do not generate) addressing modes using prefixed load and
>> -store instructions when the option @option{-mcpu=future} is used.
>> +store instructions.  The @option{-mprefixed} option requires that
>> +the option @option{-mcpu=power10} (or later @var{cpu_type}) is enabled.
> 
> Just "or later" please.  The "CPU_TYPE" thing is local to the -mcpu=
> description, let's not refer to it from elsewhere.
> 
>>  @item -mmma
>>  @itemx -mno-mma
>>  @opindex mmma
>>  @opindex mno-mma
>> -Generate (do not generate) the MMA instructions when the option
>> -@option{-mcpu=future} is used.
>> +Generate (do not generate) the MMA instructions.  The @option{-mma}
>> +option requires that the option @option{-mcpu=power10} (or later
>> +@var{cpu_type}) is enabled.
> 
> (once more)
> 
> Okay for trunk with those changes.  Thanks!
> 
> 

Thanks!  All comments are addressed and committed as r12-5143.

BR,
Kewen

> Segher
> 





Re: [PATCH] rs6000/doc: Rename future cpu with power10

2021-11-10 Thread Kewen.Lin via Gcc-patches
Hi Segher,

on 2021/11/10 下午4:52, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Nov 10, 2021 at 01:41:25PM +0800, Kewen.Lin wrote:
>> Commmit 5d9d0c94588 renamed future to power10 and ace60939fd2
>> updated the documentation for "future" renaming.  This patch
>> is to rename the remaining "future architecture" references in
>> documentation.
> 
> Good find :-)
> 
>> @@ -28613,7 +28613,7 @@ the offset with a symbol reference to a canary in 
>> the TLS block.
>>  @opindex mpcrel
>>  @opindex mno-pcrel
>>  Generate (do not generate) pc-relative addressing when the option
>> -@option{-mcpu=future} is used.  The @option{-mpcrel} option requires
>> +@option{-mcpu=power10} is used.  The @option{-mpcrel} option requires
>>  that the medium code model (@option{-mcmodel=medium}) and prefixed
>>  addressing (@option{-mprefixed}) options are enabled.
> 
> It still sounds strange, and factually incorrect really: the -mpcrel
> option says to use pc-relative processing, no matter if -mcpu=power10 is
> used or not.  For example, it will work fine with later CPUs as well.
> 

Good point!  The comment is also applied for mma, prefixed and float128.

> So maybe this should just delete from after "addressing" to the end of
> that line?  It already says what the prerequisites are, on the very next
> line :-)
> 

Thanks for the suggestion.  The updated version is inlined as below.
Not sure the update for float128 looks good enough to you.

Could you please have a look again?

BR,
Kewen
-
gcc/ChangeLog:

* doc/invoke.texi: Change references to "future cpu" to "power10",
"-mcpu=future" to "-mcpu=power10".  Adjust with "later cpu_type".

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2ea23d07c4c..aa0a20924bf 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -27597,7 +27597,7 @@ Supported values for @var{cpu_type} are @samp{401}, 
@samp{403},
 @samp{e6500}, @samp{ec603e}, @samp{G3}, @samp{G4}, @samp{G5},
 @samp{titan}, @samp{power3}, @samp{power4}, @samp{power5}, @samp{power5+},
 @samp{power6}, @samp{power6x}, @samp{power7}, @samp{power8},
-@samp{power9}, @samp{future}, @samp{powerpc}, @samp{powerpc64},
+@samp{power9}, @samp{power10}, @samp{powerpc}, @samp{powerpc64},
 @samp{powerpc64le}, @samp{rs64}, and @samp{native}.

 @option{-mcpu=powerpc}, @option{-mcpu=powerpc64}, and
@@ -27779,10 +27779,10 @@ Enable/disable the @var{__float128} keyword for IEEE 
128-bit floating point
 and use either software emulation for IEEE 128-bit floating point or
 hardware instructions.

-The VSX instruction set (@option{-mvsx}, @option{-mcpu=power7},
-@option{-mcpu=power8}), or @option{-mcpu=power9} must be enabled to
-use the IEEE 128-bit floating point support.  The IEEE 128-bit
-floating point support only works on PowerPC Linux systems.
+The VSX instruction set (@option{-mvsx}, @option{-mcpu=power7} (or later
+@var{cpu_type})) must be enabled to use the IEEE 128-bit floating point
+support.  The IEEE 128-bit floating point support only works on PowerPC
+Linux systems.

 The default for @option{-mfloat128} is enabled on PowerPC Linux
 systems using the VSX instruction set, and disabled on other systems.
@@ -28612,24 +28612,25 @@ the offset with a symbol reference to a canary in the 
TLS block.
 @itemx -mno-pcrel
 @opindex mpcrel
 @opindex mno-pcrel
-Generate (do not generate) pc-relative addressing when the option
-@option{-mcpu=future} is used.  The @option{-mpcrel} option requires
-that the medium code model (@option{-mcmodel=medium}) and prefixed
-addressing (@option{-mprefixed}) options are enabled.
+Generate (do not generate) pc-relative addressing.  The @option{-mpcrel}
+option requires that the medium code model (@option{-mcmodel=medium})
+and prefixed addressing (@option{-mprefixed}) options are enabled.

 @item -mprefixed
 @itemx -mno-prefixed
 @opindex mprefixed
 @opindex mno-prefixed
 Generate (do not generate) addressing modes using prefixed load and
-store instructions when the option @option{-mcpu=future} is used.
+store instructions.  The @option{-mprefixed} option requires that
+the option @option{-mcpu=power10} (or later @var{cpu_type}) is enabled.

 @item -mmma
 @itemx -mno-mma
 @opindex mmma
 @opindex mno-mma
-Generate (do not generate) the MMA instructions when the option
-@option{-mcpu=future} is used.
+Generate (do not generate) the MMA instructions.  The @option{-mma}
+option requires that the option @option{-mcpu=power10} (or later
+@var{cpu_type}) is enabled.

 @item -mrop-protect
 @itemx -mno-rop-protect



[PATCH] rs6000/doc: Rename future cpu with power10

2021-11-09 Thread Kewen.Lin via Gcc-patches
Hi,

Commmit 5d9d0c94588 renamed future to power10 and ace60939fd2
updated the documentation for "future" renaming.  This patch
is to rename the remaining "future architecture" references in
documentation.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* doc/invoke.texi: Change references to "future cpu" to "power10",
"-mcpu=future" to "-mcpu=power10".

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2ea23d07c4c..7b23165c477 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -27597,7 +27597,7 @@ Supported values for @var{cpu_type} are @samp{401}, 
@samp{403},
 @samp{e6500}, @samp{ec603e}, @samp{G3}, @samp{G4}, @samp{G5},
 @samp{titan}, @samp{power3}, @samp{power4}, @samp{power5}, @samp{power5+},
 @samp{power6}, @samp{power6x}, @samp{power7}, @samp{power8},
-@samp{power9}, @samp{future}, @samp{powerpc}, @samp{powerpc64},
+@samp{power9}, @samp{power10}, @samp{powerpc}, @samp{powerpc64},
 @samp{powerpc64le}, @samp{rs64}, and @samp{native}.

 @option{-mcpu=powerpc}, @option{-mcpu=powerpc64}, and
@@ -28613,7 +28613,7 @@ the offset with a symbol reference to a canary in the 
TLS block.
 @opindex mpcrel
 @opindex mno-pcrel
 Generate (do not generate) pc-relative addressing when the option
-@option{-mcpu=future} is used.  The @option{-mpcrel} option requires
+@option{-mcpu=power10} is used.  The @option{-mpcrel} option requires
 that the medium code model (@option{-mcmodel=medium}) and prefixed
 addressing (@option{-mprefixed}) options are enabled.

@@ -28622,14 +28622,14 @@ addressing (@option{-mprefixed}) options are enabled.
 @opindex mprefixed
 @opindex mno-prefixed
 Generate (do not generate) addressing modes using prefixed load and
-store instructions when the option @option{-mcpu=future} is used.
+store instructions when the option @option{-mcpu=power10} is used.

 @item -mmma
 @itemx -mno-mma
 @opindex mmma
 @opindex mno-mma
 Generate (do not generate) the MMA instructions when the option
-@option{-mcpu=future} is used.
+@option{-mcpu=power10} is used.

 @item -mrop-protect
 @itemx -mno-rop-protect



Re: Values of WIDE_INT_MAX_ELTS in gcc11 and gcc12 are different

2021-11-04 Thread Kewen.Lin via Gcc-patches
Hi Qing,

on 2021/11/5 上午4:37, Qing Zhao via Gcc-patches wrote:
> Hi,
> 
> I noticed that the macro “WIDE_INT_MAX_ELTS” has different values in GCC11 
> and GCC12 (on the same X86 machine)
> 
> For gcc11:
> 
> wide int max elts =3
> 
> For gcc12:
> 
> wide int max elts =9
> 
> Does anyone know what’s the reason for this difference? 
> 

I guess it's due to commit r12-979 (782e57f2c09).

For

  #define WIDE_INT_MAX_ELTS \
((MAX_BITSIZE_MODE_ANY_INT + HOST_BITS_PER_WIDE_INT) / 
HOST_BITS_PER_WIDE_INT

Before the change, the MAX_BITSIZE_MODE_ANY_INT is explicitly set as 160.

  -#define MAX_BITSIZE_MODE_ANY_INT (160)

  it's (160+64)/64 = 3

After the change, MAX_BITSIZE_MODE_ANY_INT is counted in function emit_max_int
and becomes 512.

  it's (512+64)/64 = 9

As the commit log, the previous 160 bits seems a workaround for some gone
problem, now the commit makes it use the default way to align with the other
ports.

BR,
Kewen

> Thanks a lot for any help.
> 
> Qing
>


PING^2 [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-11-04 Thread Kewen.Lin via Gcc-patches
Hi,

As the discussions and the testing result under the main thread, this
patch would be safe.

Ping for this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580357.html

BR,
Kewen

> 
> on 2021/9/28 下午4:13, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> As the discussion in PR102347, currently builtin_decl is invoked so
>> early, it's when making up the function_decl for builtin functions,
>> at that time the rs6000_builtin_mask could be wrong for those
>> builtins sitting in #pragma/attribute target functions, though it
>> will be updated properly later when LTO processes all nodes.
>>
>> This patch is to align with the practice i386 port adopts, also
>> align with r10-7462 by relaxing builtin mask checking in some places.
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu P9 and
>> powerpc64-linux-gnu P8.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  PR target/102347
>>  * config/rs6000/rs6000-call.c (rs6000_builtin_decl): Remove builtin
>>  mask check.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  PR target/102347
>>  * gcc.target/powerpc/pr102347.c: New test.
>>
>> ---
>>  gcc/config/rs6000/rs6000-call.c | 14 --
>>  gcc/testsuite/gcc.target/powerpc/pr102347.c | 15 +++
>>  2 files changed, 19 insertions(+), 10 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102347.c
>>
>> diff --git a/gcc/config/rs6000/rs6000-call.c 
>> b/gcc/config/rs6000/rs6000-call.c
>> index fd7f24da818..15e0e09c07d 100644
>> --- a/gcc/config/rs6000/rs6000-call.c
>> +++ b/gcc/config/rs6000/rs6000-call.c
>> @@ -13775,23 +13775,17 @@ rs6000_init_builtins (void)
>>  }
>>  }
>>
>> -/* Returns the rs6000 builtin decl for CODE.  */
>> +/* Returns the rs6000 builtin decl for CODE.  Note that we don't check
>> +   the builtin mask here since there could be some #pragma/attribute
>> +   target functions and the rs6000_builtin_mask could be wrong when
>> +   this checking happens, though it will be updated properly later.  */
>>
>>  tree
>>  rs6000_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
>>  {
>> -  HOST_WIDE_INT fnmask;
>> -
>>if (code >= RS6000_BUILTIN_COUNT)
>>  return error_mark_node;
>>
>> -  fnmask = rs6000_builtin_info[code].mask;
>> -  if ((fnmask & rs6000_builtin_mask) != fnmask)
>> -{
>> -  rs6000_invalid_builtin ((enum rs6000_builtins)code);
>> -  return error_mark_node;
>> -}
>> -
>>return rs6000_builtin_decls[code];
>>  }
>>
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr102347.c 
>> b/gcc/testsuite/gcc.target/powerpc/pr102347.c
>> new file mode 100644
>> index 000..05c439a8dac
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr102347.c
>> @@ -0,0 +1,15 @@
>> +/* { dg-do link } */
>> +/* { dg-require-effective-target power10_ok } */
>> +/* { dg-require-effective-target lto } */
>> +/* { dg-options "-flto -mdejagnu-cpu=power9" } */
>> +
>> +/* Verify there are no error messages in LTO mode.  */
>> +
>> +#pragma GCC target "cpu=power10"
>> +int main ()
>> +{
>> +  float *b;
>> +  __vector_quad c;
>> +  __builtin_mma_disassemble_acc (b, );
>> +  return 0;
>> +}
>> --
>> 2.27.0
>>
> 


PING^5 [PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-11-04 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html

One related patch [1] is ready to commit, whose test cases rely on
this patch if no changes are applied to them.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579658.html

BR,
Kewen

>>>> on 2021/9/1 下午2:55, Kewen.Lin via Gcc-patches wrote:
>>>>> Hi!
>>>>>
>>>>> This patch is to fix the inconsistent behaviors for non-LTO mode
>>>>> and LTO mode.  As Martin pointed out, currently the function
>>>>> rs6000_can_inline_p simply makes it inlinable if callee_tree is
>>>>> NULL, but it's wrong, we should use the command line options
>>>>> from target_option_default_node as default.  It also replaces
>>>>> rs6000_isa_flags with the one from target_option_default_node
>>>>> when caller_tree is NULL as rs6000_isa_flags could probably
>>>>> change since initialization.
>>>>>
>>>>> It also extends the scope of the check for the case that callee
>>>>> has explicit set options, for test case pr102059-2.c inlining can
>>>>> happen unexpectedly before, it's fixed accordingly.
>>>>>
>>>>> As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
>>>>> can be neglected for inlining, this patch also exludes them when
>>>>> the callee is attributed by always_inline.
>>>>>
>>>>> Bootstrapped and regtested on powerpc64le-linux-gnu Power9.
>>>>>
>>>>> BR,
>>>>> Kewen
>>>>> -
>>>>> gcc/ChangeLog:
>>>>>
>>>>>   PR ipa/102059
>>>>>   * config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
>>>>>   target_option_default_node and consider always_inline_safe flags.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>>   PR ipa/102059
>>>>>   * gcc.target/powerpc/pr102059-1.c: New test.
>>>>>   * gcc.target/powerpc/pr102059-2.c: New test.
>>>>>   * gcc.target/powerpc/pr102059-3.c: New test.
>>>>>   * gcc.target/powerpc/pr102059-4.c: New test.
>>>>>
>>>>


PING^3 [PATCH v2] rs6000: Modify the way for extra penalized cost

2021-11-04 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

>> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> This patch follows the discussions here[1][2], where Segher
>>> pointed out the existing way to guard the extra penalized
>>> cost for strided/elementwise loads with a magic bound does
>>> not scale.
>>>
>>> The way with nunits * stmt_cost can get one much
>>> exaggerated penalized cost, such as: for V16QI on P8, it's
>>> 16 * 20 = 320, that's why we need one bound.  To make it
>>> better and more readable, the penalized cost is simplified
>>> as:
>>>
>>> unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>> unsigned extra_cost = nunits * adjusted_cost;
>>>
>>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>>> while for the other modes, it uses 1.  It's mainly concluded
>>> from the performance evaluations.  One thing might be
>>> related is that: More units vector gets constructed, more
>>> instructions are used.  It has more chances to schedule them
>>> better (even run in parallelly when enough available units
>>> at that time), so it seems reasonable not to penalize more
>>> for them.
>>>
>>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>>> sets O2-vect and Ofast-unroll show this change is neutral.
>>>
>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>
>>> Is it ok for trunk?
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>>
>>> BR,
>>> Kewen
>>> -
>>> gcc/ChangeLog:
>>>
>>> * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>> the way to compute extra penalized cost.  Remove useless parameter.
>>> (rs6000_add_stmt_cost): Adjust the call to function
>>> rs6000_update_target_cost_per_stmt.
>>>
>>>
>>> ---
>>>  gcc/config/rs6000/rs6000.c | 31 ++-
>>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>>> index dd42b0964f1..8200e1152c2 100644
>>> --- a/gcc/config/rs6000/rs6000.c
>>> +++ b/gcc/config/rs6000/rs6000.c
>>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
>>> *data,
>>> enum vect_cost_for_stmt kind,
>>> struct _stmt_vec_info *stmt_info,
>>> enum vect_cost_model_location where,
>>> -   int stmt_cost,
>>> unsigned int orig_count)
>>>  {
>>>
>>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt 
>>> (rs6000_cost_data *data,
>>> {
>>>   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>>   unsigned int nunits = vect_nunits_for_cost (vectype);
>>> - unsigned int extra_cost = nunits * stmt_cost;
>>> - /* As function rs6000_builtin_vectorization_cost shows, we have
>>> -priced much on V16QI/V8HI vector construction as their units,
>>> -if we penalize them with nunits * stmt_cost, it can result in
>>> -an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>> -is 20 and nunits is 16, the extra cost is 320 which looks
>>> -much exaggerated.  So let's use one maximum bound for the
>>> -extra penalized cost for vector construction here.  */
>>> - const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>> - if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>> -   extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>>> + /* Don't expect strided/elementwise loads for just 1 nunit.  */
>>> + gcc_assert (nunits > 1);
>>> + /* i386 port adopts nunits * stmt_cost as the penalized cost
>>> +for this kind of penalization, we used to follow it but
>>> +found it could result in an unreliable body cost especially
>>> +for V16QI/V8HI modes.  To make it better, we choose this
>>> +new heuristic: for each scalar load, we use 2 as penalized
>>> +cost for

PING^6 [PATCH v2] combine: Tweak the condition of last_set invalidation

2021-11-04 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html

BR,
Kewen

>>>>> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote:
>>>>>> Hi Segher,
>>>>>>
>>>>>> Thanks for the review!
>>>>>>
>>>>>> on 2021/6/10 上午4:17, Segher Boessenkool wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>> On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
>>>>>>>> Currently we have the check:
>>>>>>>>
>>>>>>>>   if (!insn
>>>>>>>>  || (value && rsp->last_set_table_tick >= 
>>>>>>>> label_tick_ebb_start))
>>>>>>>>rsp->last_set_invalid = 1; 
>>>>>>>>
>>>>>>>> which means if we want to record some value for some reg and
>>>>>>>> this reg got refered before in a valid scope,
>>>>>>>
>>>>>>> If we already know it is *set* in this same extended basic block.
>>>>>>> Possibly by the same instruction btw.
>>>>>>>
>>>>>>>> we invalidate the
>>>>>>>> set of reg (last_set_invalid to 1).  It avoids to find the wrong
>>>>>>>> set for one reg reference, such as the case like:
>>>>>>>>
>>>>>>>>... op regX  // this regX could find wrong last_set below
>>>>>>>>regX = ...   // if we think this set is valid
>>>>>>>>... op regX
>>>>>>>
>>>>>>> Yup, exactly.
>>>>>>>
>>>>>>>> But because of retry's existence, the last_set_table_tick could
>>>>>>>> be set by some later reference insns, but we see it's set due
>>>>>>>> to retry on the set (for that reg) insn again, such as:
>>>>>>>>
>>>>>>>>insn 1
>>>>>>>>insn 2
>>>>>>>>
>>>>>>>>regX = ... --> (a)
>>>>>>>>... op regX--> (b)
>>>>>>>>
>>>>>>>>insn 3
>>>>>>>>
>>>>>>>>// assume all in the same BB.
>>>>>>>>
>>>>>>>> Assuming we combine 1, 2 -> 3 sucessfully and replace them as two
>>>>>>>> (3 insns -> 2 insns),
>>>>>>>
>>>>>>> This will delete insn 1 and write the combined result to insns 2 and 3.
>>>>>>>
>>>>>>>> retrying from insn1 or insn2 again:
>>>>>>>
>>>>>>> Always 2, but your point remains valid.
>>>>>>>
>>>>>>>> it will scan insn (a) again, the below condition holds for regX:
>>>>>>>>
>>>>>>>>   (value && rsp->last_set_table_tick >= label_tick_ebb_start)
>>>>>>>>
>>>>>>>> it will mark this set as invalid set.  But actually the
>>>>>>>> last_set_table_tick here is set by insn (b) before retrying, so it
>>>>>>>> should be safe to be taken as valid set.
>>>>>>>
>>>>>>> Yup.
>>>>>>>
>>>>>>>> This proposal is to check whether the last_set_table safely happens
>>>>>>>> after the current set, make the set still valid if so.
>>>>>>>
>>>>>>>> Full SPEC2017 building shows this patch gets more sucessful combines
>>>>>>>> from 1902208 to 1902243 (trivial though).
>>>>>>>
>>>>>>> Do you have some example, or maybe even a testcase?  :-)
>>>>>>>
>>>>>>
>>>>>> Sorry for the late reply, it took some time to get one reduced case.
>>>>>>
>>>>>> typedef struct SA *pa_t;
>>>>>>
>>>>>> struct SC {
>>>>>>   int h;
>>>>>>   pa_t elem[];
>>>>>> };
>>>>>>
>>>>>> struct SD {
>>>>>>   struct SC *e;
>>>>>> };
>>>>>>
>>>>>> struct SA {
>>>>>>   struct {
>>>>>> struct SD 

Re: [PATCH] vect: Add bias parameter for partial vectorization

2021-11-03 Thread Kewen.Lin via Gcc-patches
Hi Robin,

on 2021/11/3 上午4:16, Robin Dapp wrote:
> Hi,
> 
> thanks for the helpful comments. The attached v2 addresses the following
> points from them:
> 
>  - Save the bias in loop_vinfo and set it once in vect_verify_loop_lens.
>  - Add code to handle the bias in vect_set_loop_controls_directly.
>  - Adjust costing.
>  - Add comments for the new len_load parameter as well as change wording.
>  - Include the rs6000 change directly.
> 
> I'm not entirely satisfied with the bias code for the loop controls,
> mainly because of side effects I might not have considered.  The test
> suites show no new regressions and bootstrap succeeded, though.
> 
> Regards
>  Robin
> 
> 
> vll-v2.patch
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index bf033e31c1c..dc2756f83e9 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -5637,7 +5637,8 @@
>  (define_expand "len_load_v16qi"
>[(match_operand:V16QI 0 "vlogical_operand")
> (match_operand:V16QI 1 "memory_operand")
> -   (match_operand:QI 2 "gpc_reg_operand")]
> +   (match_operand:QI 2 "gpc_reg_operand")
> +   (match_operand:QI 3 "zero_constant")]
>"TARGET_P9_VECTOR && TARGET_64BIT"
>  {
>rtx mem = XEXP (operands[1], 0);
> @@ -5651,6 +5652,7 @@
>[(match_operand:V16QI 0 "memory_operand")
> (match_operand:V16QI 1 "vlogical_operand")
> (match_operand:QI 2 "gpc_reg_operand")
> +   (match_operand:QI 3 "zero_constant")
>]
>"TARGET_P9_VECTOR && TARGET_64BIT"
>  {

Nice, thanks!

> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 2b41cb7fb7b..265c76f1609 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5213,7 +5213,10 @@ which must be a vector mode.  Operand 2 has whichever 
> integer mode the
>  target prefers.  If operand 2 exceeds the number of elements in mode
>  @var{m}, the behavior is undefined.  If the target prefers the length
>  to be measured in bytes rather than elements, it should only implement
> -this pattern for vectors of @code{QI} elements.
> +this pattern for vectors of @code{QI} elements.  Operand 3 specifies
> +a bias predicate that determines whether a length of zero is permitted
> +or not.  If permitted, the predicate should only allow a zero immediate,
> +otherwhise it should only allow an immediate value of -1.
>  
>  This pattern is not allowed to @code{FAIL}.
>  
> @@ -5226,7 +5229,10 @@ a vector mode.  Operand 2 has whichever integer mode 
> the target prefers.
>  If operand 2 exceeds the number of elements in mode @var{m}, the behavior
>  is undefined.  If the target prefers the length to be measured in bytes
>  rather than elements, it should only implement this pattern for vectors
> -of @code{QI} elements.
> +of @code{QI} elements.  Operand 3 specifies a bias predicate that
> +determines whether a length of zero is permitted or not.  If permitted,
> +the predicate should only allow a zero constant, otherwhise it should
> +only allow an immediate value of -1.
>  

Nit: s/otherwhise/otherwise/ (same for len_load).

Since these optabs are also for length in elements (although there is not
this kind of usage), I guess the current bias -1 support would work well
for length in elements too?  Nice!  :)

>  This pattern is not allowed to @code{FAIL}.
>  
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index 8312d08aab2..90fdc440248 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2696,9 +2696,9 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
>  static void
>  expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  {
> -  class expand_operand ops[3];
> -  tree type, lhs, rhs, maskt;
> -  rtx mem, target, mask;
> +  class expand_operand ops[4];
> +  tree type, lhs, rhs, maskt, biast;
> +  rtx mem, target, mask, bias;
>insn_code icode;
>  
>maskt = gimple_call_arg (stmt, 2);
> @@ -2723,11 +2723,20 @@ expand_partial_load_optab_fn (internal_fn, gcall 
> *stmt, convert_optab optab)
>create_output_operand ([0], target, TYPE_MODE (type));
>create_fixed_operand ([1], mem);
>if (optab == len_load_optab)
> -create_convert_operand_from ([2], mask, TYPE_MODE (TREE_TYPE 
> (maskt)),
> -  TYPE_UNSIGNED (TREE_TYPE (maskt)));
> +{
> +  create_convert_operand_from ([2], mask, TYPE_MODE (TREE_TYPE 
> (maskt)),
> +TYPE_UNSIGNED (TREE_TYPE (maskt)));
> +  biast = gimple_call_arg (stmt, 3);
> +  bias = expand_normal (biast);
> +  create_input_operand ([3], bias, QImode);
> +  expand_insn (icode, 4, ops);
> +}
>else
> +{
>  create_input_operand ([2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> -  expand_insn (icode, 3, ops);
> +expand_insn (icode, 3, ops);
> +}
> +
>if (!rtx_equal_p (target, ops[0].value))
>  emit_move_insn (target, ops[0].value);
>  }
> @@ -2741,9 +2750,9 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, 
> convert_optab optab)
>  static void
>  

Re: [PATCH] vect: Add bias parameter for partial vectorization

2021-10-28 Thread Kewen.Lin via Gcc-patches
Hi Robin,

on 2021/10/28 下午10:44, Robin Dapp wrote:
> Hi,
> 
> as discussed in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-October/582627.html this
> introduces a bias parameter for the len_load/len_store ifns as well as
> optabs that is meant to distinguish between Power and s390 variants.
> The default is a bias of 0, while in s390's case vll/vstl do not support
> lengths of zero bytes and a bias of -1 should be used.
> 
> Bootstrapped and regtested on Power9 (--with-cpu=power9) and s390
> (--with-arch=z15).
> 
> The tiny changes in the Power backend I will post separately.
> 

Thanks for extending this!

I guess your separated Power (rs6000) patch will be committed with this one
together? otherwise I'm worried that those existing rs6000 partial vector
cases could fail since the existing rs6000 optabs miss the new operand which
isn't optional.

You might need to update the documentation doc/md.texi for the new operand
in sections len_load_@var{m} and len_store_@var{m}, and might want to add
the costing consideration for this non-zero biasing in hunk
"
  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
{
"
of function vect_estimate_min_profitable_iters.

I may think too much, it seems we can have one assertion in function
vect_verify_loop_lens to ensure the (internal_len_load_bias_supported ==
internal_len_load_bias_supported) to avoid some mixed biasing cases from
some weird targets or optab typos.

BR,
Kewen


Re: [PATCH] rs6000: Fix ICE of vect cost related to V1TI [PR102767]

2021-10-27 Thread Kewen.Lin via Gcc-patches
on 2021/10/28 上午9:43, David Edelsohn wrote:
> On Wed, Oct 27, 2021 at 9:30 PM Kewen.Lin  wrote:
>>
>> Hi David,
>>
>> Thanks for the review!
>>
>> on 2021/10/27 下午9:12, David Edelsohn wrote:
>>> On Sun, Oct 24, 2021 at 11:04 PM Kewen.Lin  wrote:
>>>>
>>>> Hi,
>>>>
>>>> As PR102767 shows, the commit r12-3482 exposed one ICE in function
>>>> rs6000_builtin_vectorization_cost.  We claims V1TI supports movmisalign
>>>> on rs6000 (See define_expand "movmisalign"), so it return true in
>>>> rs6000_builtin_support_vector_misalignment for misalign 8.  Later in
>>>> the cost querying rs6000_builtin_vectorization_cost, we don't have
>>>> the arms to handle the V1TI input under (TARGET_VSX &&
>>>> TARGET_ALLOW_MOVMISALIGN).
>>>>
>>>> The proposed fix is to add the consideration for V1TI, simply make it
>>>> as the cost for doubleword which is apparently bigger than the cost of
>>>> scalar, won't have the vectorization to happen, just to keep consistency
>>>> and avoid ICE.  Another thought is to not support movmisalign for V1TI,
>>>> but it sounds like a bad idea since it doesn't match the reality.
>>>>
>>>> Bootstrapped and regtested on powerpc64le-linux-gnu P9 and
>>>> powerpc64-linux-gnu P8.
>>>>
>>>> Is it ok for trunk?
>>>>
>>>> BR,
>>>> Kewen
>>>> -
>>>> gcc/ChangeLog:
>>>>
>>>> PR target/102767
>>>> * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): 
>>>> Consider
>>>> V1T1 mode for unaligned load and store.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> PR target/102767
>>>> * gcc.target/powerpc/ppc-fortran/pr102767.f90: New file.
>>>>
>>>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>>>> index b7ea1483da5..73d3e06c3fc 100644
>>>> --- a/gcc/config/rs6000/rs6000.c
>>>> +++ b/gcc/config/rs6000/rs6000.c
>>>> @@ -5145,7 +5145,8 @@ rs6000_builtin_vectorization_cost (enum 
>>>> vect_cost_for_stmt type_of_cost,
>>>> if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
>>>>   {
>>>> elements = TYPE_VECTOR_SUBPARTS (vectype);
>>>> -   if (elements == 2)
>>>> +   /* See PR102767, consider V1TI to keep consistency.  */
>>>> +   if (elements == 2 || elements == 1)
>>>>   /* Double word aligned.  */
>>>>   return 4;
>>>>
>>>> @@ -5184,10 +5185,11 @@ rs6000_builtin_vectorization_cost (enum 
>>>> vect_cost_for_stmt type_of_cost,
>>>>
>>>>  if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
>>>>{
>>>> -elements = TYPE_VECTOR_SUBPARTS (vectype);
>>>> -if (elements == 2)
>>>> -  /* Double word aligned.  */
>>>> -  return 2;
>>>> +   elements = TYPE_VECTOR_SUBPARTS (vectype);
>>>> +   /* See PR102767, consider V1TI to keep consistency.  */
>>>> +   if (elements == 2 || elements == 1)
>>>> + /* Double word aligned.  */
>>>> + return 2;
>>>
>>> This section of the patch incorrectly changes the indentation.  Please
>>> use the correct indentation.
>>>
>>
>> The indentation change is intentional since the original identation is
>> wrong (more than 8 spaces leading the lines), there are more wrong
>> identation lines above the first changed line, but I thought it seems a
>> bad idea to fix them too when they are unrelated to what this patch
>> wants to fix, so I left them alone.
>>
>> With the above clarification, may I push this patch without any updates
>> for the mentioned indentation issue?
> 
> If you correct the indentation, you should adjust it for the entire
> block, not just the lines that you change.  If you want to fix the
> entire block to TAB+spaces as well, okay.  You didn't mention that you
> were fixing the indentation in the explanation of the patch.
> 

Sorry for not mentioning that.  Got it, I'll reformat the entire block then,
also with additional notes in the commit log.

Thanks again.

BR,
Kewen

> Thank, David
> 
>>
>>>>
>>>>  if (

Re: [PATCH] rs6000: Fix ICE of vect cost related to V1TI [PR102767]

2021-10-27 Thread Kewen.Lin via Gcc-patches
Hi David,

Thanks for the review!

on 2021/10/27 下午9:12, David Edelsohn wrote:
> On Sun, Oct 24, 2021 at 11:04 PM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As PR102767 shows, the commit r12-3482 exposed one ICE in function
>> rs6000_builtin_vectorization_cost.  We claims V1TI supports movmisalign
>> on rs6000 (See define_expand "movmisalign"), so it return true in
>> rs6000_builtin_support_vector_misalignment for misalign 8.  Later in
>> the cost querying rs6000_builtin_vectorization_cost, we don't have
>> the arms to handle the V1TI input under (TARGET_VSX &&
>> TARGET_ALLOW_MOVMISALIGN).
>>
>> The proposed fix is to add the consideration for V1TI, simply make it
>> as the cost for doubleword which is apparently bigger than the cost of
>> scalar, won't have the vectorization to happen, just to keep consistency
>> and avoid ICE.  Another thought is to not support movmisalign for V1TI,
>> but it sounds like a bad idea since it doesn't match the reality.
>>
>> Bootstrapped and regtested on powerpc64le-linux-gnu P9 and
>> powerpc64-linux-gnu P8.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> PR target/102767
>> * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): 
>> Consider
>> V1T1 mode for unaligned load and store.
>>
>> gcc/testsuite/ChangeLog:
>>
>> PR target/102767
>> * gcc.target/powerpc/ppc-fortran/pr102767.f90: New file.
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index b7ea1483da5..73d3e06c3fc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5145,7 +5145,8 @@ rs6000_builtin_vectorization_cost (enum 
>> vect_cost_for_stmt type_of_cost,
>> if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
>>   {
>> elements = TYPE_VECTOR_SUBPARTS (vectype);
>> -   if (elements == 2)
>> +   /* See PR102767, consider V1TI to keep consistency.  */
>> +   if (elements == 2 || elements == 1)
>>   /* Double word aligned.  */
>>   return 4;
>>
>> @@ -5184,10 +5185,11 @@ rs6000_builtin_vectorization_cost (enum 
>> vect_cost_for_stmt type_of_cost,
>>
>>  if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
>>{
>> -elements = TYPE_VECTOR_SUBPARTS (vectype);
>> -if (elements == 2)
>> -  /* Double word aligned.  */
>> -  return 2;
>> +   elements = TYPE_VECTOR_SUBPARTS (vectype);
>> +   /* See PR102767, consider V1TI to keep consistency.  */
>> +   if (elements == 2 || elements == 1)
>> + /* Double word aligned.  */
>> + return 2;
> 
> This section of the patch incorrectly changes the indentation.  Please
> use the correct indentation.
> 

The indentation change is intentional since the original identation is
wrong (more than 8 spaces leading the lines), there are more wrong
identation lines above the first changed line, but I thought it seems a
bad idea to fix them too when they are unrelated to what this patch
wants to fix, so I left them alone.

With the above clarification, may I push this patch without any updates
for the mentioned indentation issue?

>>
>>  if (elements == 4)
>>{
>> diff --git a/gcc/testsuite/gcc.target/powerpc/ppc-fortran/pr102767.f90 
>> b/gcc/testsuite/gcc.target/powerpc/ppc-fortran/pr102767.f90
>> new file mode 100644
>> index 000..a4122482989
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/ppc-fortran/pr102767.f90
>> @@ -0,0 +1,21 @@
>> +! { dg-require-effective-target powerpc_vsx_ok }
>> +! { dg-options "-mvsx -O2 -ftree-vectorize -mno-efficient-unaligned-vsx" }
>> +
>> +INTERFACE
>> +  FUNCTION elemental_mult (a, b, c)
>> +type(*), DIMENSION(..) :: a, b, c
>> +  END
>> +END INTERFACE
>> +
>> +allocatable  z
>> +integer, dimension(2,2) :: a, b
>> +call test_CFI_address
>> +contains
>> +  subroutine test_CFI_address
>> +if (elemental_mult (z, x, y) .ne. 0) stop
>> +a = reshape ([4,3,2,1], [2,2])
>> +b = reshape ([2,3,4,5], [2,2])
>> +if (elemental_mult (i, a, b) .ne. 0) stop
>> +  end
>> +end
>> +
>>
> 
> The patch is okay with the indentation correction.
> 
> Thanks, David
> 

Thanks!

BR,
Kewen


Re: [committed] testsuite: Fix up gcc.dg/pr102897.c testcase [PR102897]

2021-10-27 Thread Kewen.Lin via Gcc-patches
Hi Jakub,

on 2021/10/27 下午3:51, Jakub Jelinek wrote:
> On Tue, Oct 26, 2021 at 11:40:01AM +0800, Kewen.Lin via Gcc-patches wrote:
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.dg/pr102897.c: New test.
> 
> The testcase FAILs on i686-linux due to:
> FAIL: gcc.dg/pr102897.c (test for excess errors)
> Excess errors:
> .../gcc/gcc/testsuite/gcc.dg/pr102897.c:11:1: warning: MMX vector return 
> without MMX enabled changes the ABI [-Wpsabi]
> .../gcc/gcc/testsuite/gcc.dg/pr102897.c:10:10: warning: MMX vector argument 
> without MMX enabled changes the ABI [-Wpsabi]
> Fixed by adding -Wno-psabi.
> 
> Tested on x86_64-linux and i686-linux, committed to trunk as obvious.
> 

Thanks for fixing this up!

BR,
Kewen

> 2021-10-27  Jakub Jelinek  
> 
>   * gcc.dg/pr102897.c: Add -Wno-psabi to dg-options.
> 
> --- gcc/testsuite/gcc.dg/pr102897.c.jj2021-10-27 09:00:28.848276246 
> +0200
> +++ gcc/testsuite/gcc.dg/pr102897.c   2021-10-27 09:40:45.628296807 +0200
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* Specify C99 to avoid the warning/error on compound literals.  */
> -/* { dg-options "-O1 -std=c99" } */
> +/* { dg-options "-O1 -std=c99 -Wno-psabi" } */
>  
>  /* Verify that there is no ICE.  */
>  
> 
> 
>   Jakub
>


Re: [PATCH] forwprop: Remove incorrect assertion [PR102897]

2021-10-26 Thread Kewen.Lin via Gcc-patches
Hi Richi,

on 2021/10/26 下午3:50, Richard Biener wrote:
> On Tue, Oct 26, 2021 at 5:40 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As PR102897 shows, there is one incorrect assertion in function
>> simplify_permutation, which is based on the wrong assumption that
>> all cases with op2_type == tgt_type are handled previously, the
>> proposed fix is to remove this wrong assertion.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
> 
> I think you need to enable optimization in the new testcase, gcc.dg/ only
> runs -O0 by default which wouldn't trigger forwprop?  Please verify the
> testcase ICEs before the fix.
> 

Thanks for catching!  You are right, the optimization option is required,
I just verified it and committed with the additional "-O1" as r12-4705.

> Otherwise OK.
> 

Thanks!

BR,
Kewen

> Thanks,
> Richard,
> 
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> PR tree-optimization/102897
>> * tree-ssa-forwprop.c (simplify_permutation): Remove a wrong 
>> assertion.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.dg/pr102897.c: New test.
>> ---
>>  gcc/testsuite/gcc.dg/pr102897.c | 16 
>>  gcc/tree-ssa-forwprop.c |  2 --
>>  2 files changed, 16 insertions(+), 2 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/pr102897.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/pr102897.c 
>> b/gcc/testsuite/gcc.dg/pr102897.c
>> new file mode 100644
>> index 000..d96b0e48ccc
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/pr102897.c
>> @@ -0,0 +1,16 @@
>> +/* { dg-do compile } */
>> +/* Specify C99 to avoid the warning/error on compound literals.  */
>> +/* { dg-options "-std=c99" } */
>> +
>> +/* Verify that there is no ICE.  */
>> +
>> +typedef __attribute__((vector_size(8))) signed char int8x8_t;
>> +typedef __attribute__((vector_size(8))) unsigned char uint8x8_t;
>> +
>> +int8x8_t fn1 (int8x8_t val20, char tmp)
>> +{
>> +  uint8x8_t __trans_tmp_3;
>> +  __trans_tmp_3 = (uint8x8_t){tmp};
>> +  int8x8_t __a = (int8x8_t) __trans_tmp_3;
>> +  return __builtin_shuffle (__a, val20, (uint8x8_t){0});
>> +}
>> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
>> index 5b30d4c1a76..a830bab78ba 100644
>> --- a/gcc/tree-ssa-forwprop.c
>> +++ b/gcc/tree-ssa-forwprop.c
>> @@ -2267,8 +2267,6 @@ simplify_permutation (gimple_stmt_iterator *gsi)
>>   if (!VECTOR_TYPE_P (tgt_type))
>> return 0;
>>   tree op2_type = TREE_TYPE (op2);
>> - /* Should have folded this before.  */
>> - gcc_assert (op2_type != tgt_type);
>>
>>   /* Figure out the shrunk factor.  */
>>   poly_uint64 tgt_units = TYPE_VECTOR_SUBPARTS (tgt_type);
>> --
>> 2.27.0


[PATCH] forwprop: Remove incorrect assertion [PR102897]

2021-10-25 Thread Kewen.Lin via Gcc-patches
Hi,

As PR102897 shows, there is one incorrect assertion in function
simplify_permutation, which is based on the wrong assumption that
all cases with op2_type == tgt_type are handled previously, the
proposed fix is to remove this wrong assertion.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

BR,
Kewen
-
gcc/ChangeLog:

PR tree-optimization/102897
* tree-ssa-forwprop.c (simplify_permutation): Remove a wrong assertion.

gcc/testsuite/ChangeLog:

* gcc.dg/pr102897.c: New test.
---
 gcc/testsuite/gcc.dg/pr102897.c | 16 
 gcc/tree-ssa-forwprop.c |  2 --
 2 files changed, 16 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr102897.c

diff --git a/gcc/testsuite/gcc.dg/pr102897.c b/gcc/testsuite/gcc.dg/pr102897.c
new file mode 100644
index 000..d96b0e48ccc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr102897.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* Specify C99 to avoid the warning/error on compound literals.  */
+/* { dg-options "-std=c99" } */
+
+/* Verify that there is no ICE.  */
+
+typedef __attribute__((vector_size(8))) signed char int8x8_t;
+typedef __attribute__((vector_size(8))) unsigned char uint8x8_t;
+
+int8x8_t fn1 (int8x8_t val20, char tmp)
+{
+  uint8x8_t __trans_tmp_3;
+  __trans_tmp_3 = (uint8x8_t){tmp};
+  int8x8_t __a = (int8x8_t) __trans_tmp_3;
+  return __builtin_shuffle (__a, val20, (uint8x8_t){0});
+}
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 5b30d4c1a76..a830bab78ba 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -2267,8 +2267,6 @@ simplify_permutation (gimple_stmt_iterator *gsi)
  if (!VECTOR_TYPE_P (tgt_type))
return 0;
  tree op2_type = TREE_TYPE (op2);
- /* Should have folded this before.  */
- gcc_assert (op2_type != tgt_type);

  /* Figure out the shrunk factor.  */
  poly_uint64 tgt_units = TYPE_VECTOR_SUBPARTS (tgt_type);
--
2.27.0


[PATCH] rs6000: Fix ICE of vect cost related to V1TI [PR102767]

2021-10-24 Thread Kewen.Lin via Gcc-patches
Hi,

As PR102767 shows, the commit r12-3482 exposed one ICE in function
rs6000_builtin_vectorization_cost.  We claims V1TI supports movmisalign
on rs6000 (See define_expand "movmisalign"), so it return true in
rs6000_builtin_support_vector_misalignment for misalign 8.  Later in
the cost querying rs6000_builtin_vectorization_cost, we don't have
the arms to handle the V1TI input under (TARGET_VSX &&
TARGET_ALLOW_MOVMISALIGN).

The proposed fix is to add the consideration for V1TI, simply make it
as the cost for doubleword which is apparently bigger than the cost of
scalar, won't have the vectorization to happen, just to keep consistency
and avoid ICE.  Another thought is to not support movmisalign for V1TI,
but it sounds like a bad idea since it doesn't match the reality.

Bootstrapped and regtested on powerpc64le-linux-gnu P9 and
powerpc64-linux-gnu P8.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

PR target/102767
* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Consider
V1T1 mode for unaligned load and store.

gcc/testsuite/ChangeLog:

PR target/102767
* gcc.target/powerpc/ppc-fortran/pr102767.f90: New file.

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index b7ea1483da5..73d3e06c3fc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5145,7 +5145,8 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
  {
elements = TYPE_VECTOR_SUBPARTS (vectype);
-   if (elements == 2)
+   /* See PR102767, consider V1TI to keep consistency.  */
+   if (elements == 2 || elements == 1)
  /* Double word aligned.  */
  return 4;

@@ -5184,10 +5185,11 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,

 if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
   {
-elements = TYPE_VECTOR_SUBPARTS (vectype);
-if (elements == 2)
-  /* Double word aligned.  */
-  return 2;
+   elements = TYPE_VECTOR_SUBPARTS (vectype);
+   /* See PR102767, consider V1TI to keep consistency.  */
+   if (elements == 2 || elements == 1)
+ /* Double word aligned.  */
+ return 2;

 if (elements == 4)
   {
diff --git a/gcc/testsuite/gcc.target/powerpc/ppc-fortran/pr102767.f90 
b/gcc/testsuite/gcc.target/powerpc/ppc-fortran/pr102767.f90
new file mode 100644
index 000..a4122482989
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/ppc-fortran/pr102767.f90
@@ -0,0 +1,21 @@
+! { dg-require-effective-target powerpc_vsx_ok }
+! { dg-options "-mvsx -O2 -ftree-vectorize -mno-efficient-unaligned-vsx" }
+
+INTERFACE
+  FUNCTION elemental_mult (a, b, c)
+type(*), DIMENSION(..) :: a, b, c
+  END
+END INTERFACE
+
+allocatable  z
+integer, dimension(2,2) :: a, b
+call test_CFI_address
+contains
+  subroutine test_CFI_address
+if (elemental_mult (z, x, y) .ne. 0) stop
+a = reshape ([4,3,2,1], [2,2])
+b = reshape ([2,3,4,5], [2,2])
+if (elemental_mult (i, a, b) .ne. 0) stop
+  end
+end
+



[PATCH] vect: Don't update inits for simd_lane_access DRs [PR102789]

2021-10-24 Thread Kewen.Lin via Gcc-patches
Hi,

As PR102789 shows, when vectorizer does some peelings for alignment
in prologue, function vect_update_inits_of_drs would update the
inits of some drs.  But as the failed case, we shouldn't update the
dr for simd_lane_access, it has the fixed-length storage mainly for
the main loop, the update can make the access out of bound and access
the unexpected elements.

I tried to test this broadly to ensure it's safe, since I was not
sure if it's reasonable to exclude all kinds of simd_lane_access drs.
The testings didn't catch any failures, I hope this is on the right
track.

It's bootstrapped and regtested on:
  - x86_64-redhat-linux
  - aarch64-linux-gnu
  - powerpc64le-linux-gnu P9
  - powerpc64-linux-gnu P8 and P7.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

PR tree-optimization/102789
* tree-vect-loop-manip.c (vect_update_inits_of_drs): Do not
update inits of simd_lane_access.

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 4988c93fdb6..378b1026baa 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1820,7 +1820,8 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree 
niters,
   FOR_EACH_VEC_ELT (datarefs, i, dr)
 {
   dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr);
-  if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
+  if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt)
+ && !STMT_VINFO_SIMD_LANE_ACCESS_P (dr_info->stmt))
vect_update_init_of_dr (dr_info, niters, code);
 }
 }


PING^1 [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-10-20 Thread Kewen.Lin via Gcc-patches
Hi,

As the discussions and the testing result under the main thread, this
patch would be safe.

Ping for this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580357.html

BR,
Kewen

on 2021/9/28 下午4:13, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> As the discussion in PR102347, currently builtin_decl is invoked so
> early, it's when making up the function_decl for builtin functions,
> at that time the rs6000_builtin_mask could be wrong for those
> builtins sitting in #pragma/attribute target functions, though it
> will be updated properly later when LTO processes all nodes.
> 
> This patch is to align with the practice i386 port adopts, also
> align with r10-7462 by relaxing builtin mask checking in some places.
> 
> Bootstrapped and regress-tested on powerpc64le-linux-gnu P9 and
> powerpc64-linux-gnu P8.
> 
> Is it ok for trunk?
> 
> BR,
> Kewen
> -
> gcc/ChangeLog:
> 
>   PR target/102347
>   * config/rs6000/rs6000-call.c (rs6000_builtin_decl): Remove builtin
>   mask check.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/102347
>   * gcc.target/powerpc/pr102347.c: New test.
> 
> ---
>  gcc/config/rs6000/rs6000-call.c | 14 --
>  gcc/testsuite/gcc.target/powerpc/pr102347.c | 15 +++
>  2 files changed, 19 insertions(+), 10 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102347.c
> 
> diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
> index fd7f24da818..15e0e09c07d 100644
> --- a/gcc/config/rs6000/rs6000-call.c
> +++ b/gcc/config/rs6000/rs6000-call.c
> @@ -13775,23 +13775,17 @@ rs6000_init_builtins (void)
>  }
>  }
> 
> -/* Returns the rs6000 builtin decl for CODE.  */
> +/* Returns the rs6000 builtin decl for CODE.  Note that we don't check
> +   the builtin mask here since there could be some #pragma/attribute
> +   target functions and the rs6000_builtin_mask could be wrong when
> +   this checking happens, though it will be updated properly later.  */
> 
>  tree
>  rs6000_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
>  {
> -  HOST_WIDE_INT fnmask;
> -
>if (code >= RS6000_BUILTIN_COUNT)
>  return error_mark_node;
> 
> -  fnmask = rs6000_builtin_info[code].mask;
> -  if ((fnmask & rs6000_builtin_mask) != fnmask)
> -{
> -  rs6000_invalid_builtin ((enum rs6000_builtins)code);
> -  return error_mark_node;
> -}
> -
>return rs6000_builtin_decls[code];
>  }
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr102347.c 
> b/gcc/testsuite/gcc.target/powerpc/pr102347.c
> new file mode 100644
> index 000..05c439a8dac
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr102347.c
> @@ -0,0 +1,15 @@
> +/* { dg-do link } */
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-require-effective-target lto } */
> +/* { dg-options "-flto -mdejagnu-cpu=power9" } */
> +
> +/* Verify there are no error messages in LTO mode.  */
> +
> +#pragma GCC target "cpu=power10"
> +int main ()
> +{
> +  float *b;
> +  __vector_quad c;
> +  __builtin_mma_disassemble_acc (b, );
> +  return 0;
> +}
> --
> 2.27.0
> 



PING^4 [PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-10-20 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html

One related patch [1] is ready to commit, whose test cases rely on
this patch if no changes are applied to them.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579658.html

BR,
Kewen

>>> on 2021/9/1 下午2:55, Kewen.Lin via Gcc-patches wrote:
>>>> Hi!
>>>>
>>>> This patch is to fix the inconsistent behaviors for non-LTO mode
>>>> and LTO mode.  As Martin pointed out, currently the function
>>>> rs6000_can_inline_p simply makes it inlinable if callee_tree is
>>>> NULL, but it's wrong, we should use the command line options
>>>> from target_option_default_node as default.  It also replaces
>>>> rs6000_isa_flags with the one from target_option_default_node
>>>> when caller_tree is NULL as rs6000_isa_flags could probably
>>>> change since initialization.
>>>>
>>>> It also extends the scope of the check for the case that callee
>>>> has explicit set options, for test case pr102059-2.c inlining can
>>>> happen unexpectedly before, it's fixed accordingly.
>>>>
>>>> As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
>>>> can be neglected for inlining, this patch also exludes them when
>>>> the callee is attributed by always_inline.
>>>>
>>>> Bootstrapped and regtested on powerpc64le-linux-gnu Power9.
>>>>
>>>> BR,
>>>> Kewen
>>>> -
>>>> gcc/ChangeLog:
>>>>
>>>>PR ipa/102059
>>>>* config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
>>>>target_option_default_node and consider always_inline_safe flags.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>>PR ipa/102059
>>>>* gcc.target/powerpc/pr102059-1.c: New test.
>>>>* gcc.target/powerpc/pr102059-2.c: New test.
>>>>* gcc.target/powerpc/pr102059-3.c: New test.
>>>>* gcc.target/powerpc/pr102059-4.c: New test.
>>>>
>>>


PING^2 [PATCH v2] rs6000: Modify the way for extra penalized cost

2021-10-20 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> This patch follows the discussions here[1][2], where Segher
>> pointed out the existing way to guard the extra penalized
>> cost for strided/elementwise loads with a magic bound does
>> not scale.
>>
>> The way with nunits * stmt_cost can get one much
>> exaggerated penalized cost, such as: for V16QI on P8, it's
>> 16 * 20 = 320, that's why we need one bound.  To make it
>> better and more readable, the penalized cost is simplified
>> as:
>>
>> unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>> unsigned extra_cost = nunits * adjusted_cost;
>>
>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>> while for the other modes, it uses 1.  It's mainly concluded
>> from the performance evaluations.  One thing might be
>> related is that: More units vector gets constructed, more
>> instructions are used.  It has more chances to schedule them
>> better (even run in parallelly when enough available units
>> at that time), so it seems reasonable not to penalize more
>> for them.
>>
>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>> sets O2-vect and Ofast-unroll show this change is neutral.
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>  the way to compute extra penalized cost.  Remove useless parameter.
>>  (rs6000_add_stmt_cost): Adjust the call to function
>>  rs6000_update_target_cost_per_stmt.
>>
>>
>> ---
>>  gcc/config/rs6000/rs6000.c | 31 ++-
>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index dd42b0964f1..8200e1152c2 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
>> *data,
>>  enum vect_cost_for_stmt kind,
>>  struct _stmt_vec_info *stmt_info,
>>  enum vect_cost_model_location where,
>> -int stmt_cost,
>>  unsigned int orig_count)
>>  {
>>
>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
>> *data,
>>  {
>>tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>unsigned int nunits = vect_nunits_for_cost (vectype);
>> -  unsigned int extra_cost = nunits * stmt_cost;
>> -  /* As function rs6000_builtin_vectorization_cost shows, we have
>> - priced much on V16QI/V8HI vector construction as their units,
>> - if we penalize them with nunits * stmt_cost, it can result in
>> - an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>> - is 20 and nunits is 16, the extra cost is 320 which looks
>> - much exaggerated.  So let's use one maximum bound for the
>> - extra penalized cost for vector construction here.  */
>> -  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> -  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> -extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +  /* Don't expect strided/elementwise loads for just 1 nunit.  */
>> +  gcc_assert (nunits > 1);
>> +  /* i386 port adopts nunits * stmt_cost as the penalized cost
>> + for this kind of penalization, we used to follow it but
>> + found it could result in an unreliable body cost especially
>> + for V16QI/V8HI modes.  To make it better, we choose this
>> + new heuristic: for each scalar load, we use 2 as penalized
>> + cost for the case with 2 nunits and use 1 for the other
>> + cases.  It's without much supporting theory, mainly
>> + concluded from the broad performance evaluations on Power8,
>> + Power9 and Power10.  One possibly related point is that:
>> + vector construction for more units wo

PING^5 [PATCH v2] combine: Tweak the condition of last_set invalidation

2021-10-20 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html

BR,
Kewen

> 
>>>> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote:
>>>>> Hi Segher,
>>>>>
>>>>> Thanks for the review!
>>>>>
>>>>> on 2021/6/10 上午4:17, Segher Boessenkool wrote:
>>>>>> Hi!
>>>>>>
>>>>>> On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
>>>>>>> Currently we have the check:
>>>>>>>
>>>>>>>   if (!insn
>>>>>>>   || (value && rsp->last_set_table_tick >= 
>>>>>>> label_tick_ebb_start))
>>>>>>> rsp->last_set_invalid = 1; 
>>>>>>>
>>>>>>> which means if we want to record some value for some reg and
>>>>>>> this reg got refered before in a valid scope,
>>>>>>
>>>>>> If we already know it is *set* in this same extended basic block.
>>>>>> Possibly by the same instruction btw.
>>>>>>
>>>>>>> we invalidate the
>>>>>>> set of reg (last_set_invalid to 1).  It avoids to find the wrong
>>>>>>> set for one reg reference, such as the case like:
>>>>>>>
>>>>>>>... op regX  // this regX could find wrong last_set below
>>>>>>>regX = ...   // if we think this set is valid
>>>>>>>... op regX
>>>>>>
>>>>>> Yup, exactly.
>>>>>>
>>>>>>> But because of retry's existence, the last_set_table_tick could
>>>>>>> be set by some later reference insns, but we see it's set due
>>>>>>> to retry on the set (for that reg) insn again, such as:
>>>>>>>
>>>>>>>insn 1
>>>>>>>insn 2
>>>>>>>
>>>>>>>regX = ... --> (a)
>>>>>>>... op regX--> (b)
>>>>>>>
>>>>>>>insn 3
>>>>>>>
>>>>>>>// assume all in the same BB.
>>>>>>>
>>>>>>> Assuming we combine 1, 2 -> 3 sucessfully and replace them as two
>>>>>>> (3 insns -> 2 insns),
>>>>>>
>>>>>> This will delete insn 1 and write the combined result to insns 2 and 3.
>>>>>>
>>>>>>> retrying from insn1 or insn2 again:
>>>>>>
>>>>>> Always 2, but your point remains valid.
>>>>>>
>>>>>>> it will scan insn (a) again, the below condition holds for regX:
>>>>>>>
>>>>>>>   (value && rsp->last_set_table_tick >= label_tick_ebb_start)
>>>>>>>
>>>>>>> it will mark this set as invalid set.  But actually the
>>>>>>> last_set_table_tick here is set by insn (b) before retrying, so it
>>>>>>> should be safe to be taken as valid set.
>>>>>>
>>>>>> Yup.
>>>>>>
>>>>>>> This proposal is to check whether the last_set_table safely happens
>>>>>>> after the current set, make the set still valid if so.
>>>>>>
>>>>>>> Full SPEC2017 building shows this patch gets more sucessful combines
>>>>>>> from 1902208 to 1902243 (trivial though).
>>>>>>
>>>>>> Do you have some example, or maybe even a testcase?  :-)
>>>>>>
>>>>>
>>>>> Sorry for the late reply, it took some time to get one reduced case.
>>>>>
>>>>> typedef struct SA *pa_t;
>>>>>
>>>>> struct SC {
>>>>>   int h;
>>>>>   pa_t elem[];
>>>>> };
>>>>>
>>>>> struct SD {
>>>>>   struct SC *e;
>>>>> };
>>>>>
>>>>> struct SA {
>>>>>   struct {
>>>>> struct SD f[1];
>>>>>   } g;
>>>>> };
>>>>>
>>>>> void foo(pa_t *k, char **m) {
>>>>>   int l, i;
>>>>>   pa_t a;
>>>>>   l = (int)a->g.f[5].e;
>>>>>   i = 0;
>>>>>   for (; i < l; i++) {
>>>>> k[i]

Re: [PATCH] Adjust testcase for O2 vectorization.

2021-10-15 Thread Kewen.Lin via Gcc-patches
on 2021/10/14 下午6:56, Kewen.Lin via Gcc-patches wrote:
> Hi Hongtao,
> 
> on 2021/10/14 下午3:11, liuhongt wrote:
>> Hi Kewen:
>>   Cound you help to verify if this patch fix those regressions
>> for rs6000 port.
>>
> 
> The ppc64le run just finished, there are still some regresssions:
> 
> NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
> warnings, line 194)
> NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
> warnings, line 212)
> NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
> warnings, line 296)
> NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
> warnings, line 314)
> NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c (test for excess errors)
> NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 18)
> NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 29)
> NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 45)
> NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 55)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
> 104)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
> 137)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
> 19)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
> 39)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
> 56)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
> 70)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c (test for excess errors)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 116)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 131)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 146)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 33)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 50)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 64)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 78)
> NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 97)
> PASS->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14 (test for 
> excess errors)
> NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
> warnings, line 229)
> NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
> warnings, line 230)
> NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
> warnings, line 331)
> NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
> warnings, line 332)
> // omitting -std=gnu++17, -std=gnu++2a, -std=gnu++98
> 
> I'll have a look and get back to you tomorrow.
> 

The failure c-c++-common/Wstringop-overflow-2.c is due to that the
current proc check_vect_slp_vnqihi_store_usage is made as "cache"
but it can vary for different input patterns.  For rs6000 the test
for v2qi fails, the cached test result makes v4qi check fail
unexpectedly (should pass).  I adjusted caching for the following users
check_effective_target_vect_slp_v*_store, also refactored a bit.
One trivial change is to add one new argument macro then we can just
compile the corresponding foo* function instead of all, hope it helps
to make the debugging outputs compact.

For the failure Wstringop-overflow-76-novec.c, there is one typo
comparing to the original Wstringop-overflow-76.c.  Guess it failed
on x86 too?  It would be surprising if it passes on x86.
As to the failure Wstringop-overflow-21-novec.c, I confirmed it's
just noise, patching typos caused this failure.

One new round ppc64le testing just finished with below diff and all
previous regressions are fixed without any new regressions.


diff --git a/gcc/testsuite/gcc.dg/Wstringop-overflow-76-novec.c 
b/gcc/testsuite/gcc.dg/Wstringop-overflow-76-novec.c
index d000b587a65..1132348c5f4 100644
--- a/gcc/testsuite/gcc.dg/Wstringop-overflow-76-novec.c
+++ b/gcc/testsuite/gcc.dg/Wstringop-overflow-76-novec.c
@@ -82,7 +82,7 @@ void max_d8_p (char *q, int i)
 struct A3_5
 {
   char a3[3];  // { dg-message "at offset 3 into destination object 'a3' of 
size 3" "pr??" { xfail *-*-* } }
-  char a5[5];
+  char a5[5];  // { dg-message "at offset 5 into destination object 'a5' of 
size 5" "note" }
 };

 void max_A3_A5 (int i, struct A3_5 *pa3_5)
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 530c5769614..8736b908ec7 100644
--- a/gcc/testsuite/lib/target-supp

Re: [PATCH] Adjust testcase for O2 vectorization.

2021-10-14 Thread Kewen.Lin via Gcc-patches
Hi Hongtao,

on 2021/10/14 下午3:11, liuhongt wrote:
> Hi Kewen:
>   Cound you help to verify if this patch fix those regressions
> for rs6000 port.
> 

The ppc64le run just finished, there are still some regresssions:

NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
warnings, line 194)
NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
warnings, line 212)
NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
warnings, line 296)
NA->XPASS: c-c++-common/Wstringop-overflow-2.c  -Wc++-compat   (test for 
warnings, line 314)
NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c (test for excess errors)
NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 18)
NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 29)
NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 45)
NA->FAIL: gcc.dg/Wstringop-overflow-21-novec.c  (test for warnings, line 55)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
104)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 
137)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 19)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 39)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 56)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c note (test for warnings, line 70)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c (test for excess errors)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 116)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 131)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 146)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 33)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 50)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 64)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 78)
NA->FAIL: gcc.dg/Wstringop-overflow-76-novec.c  (test for warnings, line 97)
PASS->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14 (test for excess 
errors)
NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
warnings, line 229)
NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
warnings, line 230)
NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
warnings, line 331)
NA->FAIL: c-c++-common/Wstringop-overflow-2.c  -std=gnu++14  (test for 
warnings, line 332)
// omitting -std=gnu++17, -std=gnu++2a, -std=gnu++98

I'll have a look and get back to you tomorrow.

BR,
Kewen

> As discussed in [1], this patch add xfail/target selector to those
> testcases, also make a copy of them so that they can be tested w/o
> vectorization.
> 
> Newly added xfail/target selectors are used to check the vectorization
> capability of continuous byte/double bytes storage, these scenarios
> are exactly the part of the testcases that regressed after O2
> vectorization.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581456.html.
> 
> gcc/testsuite/ChangeLog
> 
>   PR middle-end/102722
>   PR middle-end/102697
>   PR middle-end/102462
>   PR middle-end/102706
>   * c-c++-common/Wstringop-overflow-2.c: Adjust testcase with new
>   xfail/target selector.
>   * gcc.dg/Warray-bounds-51.c: Ditto.
>   * gcc.dg/Warray-parameter-3.c: Ditto.
>   * gcc.dg/Wstringop-overflow-14.c: Ditto.
>   * gcc.dg/Wstringop-overflow-21.c: Ditto.
>   * gcc.dg/Wstringop-overflow-68.c: Ditto.
>   * gcc.dg/Wstringop-overflow-76.c: Ditto.
>   * gcc.dg/Warray-bounds-48.c: Ditto.
>   * lib/target-supports.exp (check_vect_slp_vnqihi_store_usage):
>   New function.
>   (check_effective_target_vect_slp_v2qi_store): Ditto.
>   (check_effective_target_vect_slp_v4qi_store): Ditto.
>   (check_effective_target_vect_slp_v8qi_store): Ditto.
>   (check_effective_target_vect_slp_v16qi_store): Ditto.
>   (check_effective_target_vect_slp_v2hi_store): Ditto.
>   (check_effective_target_vect_slp_v4hi_store): Ditto.
>   * c-c++-common/Wstringop-overflow-2-novec.c: New test.
>   * gcc.dg/Warray-bounds-51-novec.c: New test.
>   * gcc.dg/Warray-bounds-48-novec.c: New test.
>   * gcc.dg/Warray-parameter-3-novec.c: New test.
>   * gcc.dg/Wstringop-overflow-14-novec.c: New test.
>   * gcc.dg/Wstringop-overflow-21-novec.c: New test.
>   * gcc.dg/Wstringop-overflow-76-novec.c: New test.
> ---
>  .../c-c++-common/Wstringop-overflow-2-novec.c | 348 +
>  .../c-c++-common/Wstringop-overflow-2.c   |  26 +-
>  gcc/testsuite/gcc.dg/Warray-bounds-48-novec.c | 364 ++
>  gcc/testsuite/gcc.dg/Warray-bounds-48.c   |   6 +-
>  gcc/testsuite/gcc.dg/Warray-bounds-51-novec.c |  61 +++
>  gcc/testsuite/gcc.dg/Warray-bounds-51.c   

Re: [PATCH] rs6000/test: Adjust some cases due to O2 vect [PR102658]

2021-10-13 Thread Kewen.Lin via Gcc-patches
on 2021/10/13 下午2:29, Hongtao Liu via Gcc-patches wrote:
> On Wed, Oct 13, 2021 at 11:34 AM Hongtao Liu  wrote:
>>
>> On Tue, Oct 12, 2021 at 11:49 PM Martin Sebor  wrote:
>>>
>>> On 10/11/21 8:31 PM, Hongtao Liu wrote:
>>>> On Tue, Oct 12, 2021 at 4:08 AM Martin Sebor via Gcc-patches
>>>>  wrote:
>>>>>
>>>>> On 10/11/21 11:43 AM, Segher Boessenkool wrote:
>>>>>> On Mon, Oct 11, 2021 at 10:23:03AM -0600, Martin Sebor wrote:
>>>>>>> On 10/11/21 9:30 AM, Segher Boessenkool wrote:
>>>>>>>> On Mon, Oct 11, 2021 at 10:47:00AM +0800, Kewen.Lin wrote:
>>>>>>>>> - For generic test cases, it follows the existing suggested
>>>>>>>>> practice with necessary target/xfail selector.
>>>>>>>>
>>>>>>>> Not such a great choice.  Many of those tests do not make sense with
>>>>>>>> vectorisation enabled.  This should have been thought about, in some
>>>>>>>> cases resulting in not running the test with vectorisation enabled, and
>>>>>>>> in some cases duplicating the test, once with and once without
>>>>>>>> vectorisation.
>>>>>>>
>>>>>>> The tests detect bugs that are present both with and without
>>>>>>> vetctorization, so they should pass both ways.
>>>>>>
>>>>>> Then it should be tested both ways!  This is my point.
>>>>>
>>>>> Agreed.  (Most warnings are tested with just one set of options,
>>>>> but it's becoming apparent that the middle end ones should be
>>>>> exercised more extensively.)
>>>>>
>>>>>>
>>>>>>> That they don't
>>>>>>> tells us that that the warnings need work (they were written with
>>>>>>> an assumption that doesn't hold anymore).
>>>>>>
>>>>>> They were written in world A.  In world B many things behave
>>>>>> differently.  Transplanting the testcases from A to B without any extra
>>>>>> analysis will not test what the testcases wanted to test, and possibly
>>>>>> nothing at all anymore.
>>>>>
>>>>> Absolutely.
>>>>>
>>>>>>
>>>>>>> We need to track that
>>>>>>> work somehow, but simply xfailing them without making a record
>>>>>>> of what underlying problem the xfails correspond to isn't the best
>>>>>>> way.  In my experience, what works well is opening a bug for each
>>>>>>> distinct limitation (if one doesn't already exist) and adding
>>>>>>> a reference to it as a comment to the xfail.
>>>>>>
>>>>>> Probably, yes.
>>>>>>
>>>>>>>> But you are just following established practice, so :-)
>>>>>>
>>>>>> I also am okay with this.  If it was decided x86 does not have to deal
>>>>>> with these (generic!) problems, then why should we do other people's
>>>>>> work?
>>>>>
>>>>> I don't know that anything was decided.  I think those changes
>>>>> were made in haste, and (as you noted in your review of these
>>>>> updates to them), were incomplete (missing comments referencing
>>>>> the underlying bugs or limitations).  Now that we've noticed it
>>>>> we should try to fix it.  I'm not expecting you (or Kwen) to do
>>>>> other people's work, but it would help to let them/us know that
>>>>> there is work for us to do.  I only noticed the problem by luck.
>>>>>
>>>>>>>>> -  struct A1 a = { 0, { 1 } };   // { dg-warning
>>>>>>>>> "\\\[-Wstringop-overflow" "" { target { i?86-*-* x86_64-*-* } } }
>>>>>>>>> +  struct A1 a = { 0, { 1 } };   // { dg-warning
>>>>>>>>> "\\\[-Wstringop-overflow" "" { target { i?86-*-* x86_64-*-* 
>>>>>>>>> powerpc*-*-*
>>>>>>>>> } } }
>>>>>>>
>>>>>>> As I mentioned in the bug, when adding xfails for regressions
>>>>>>> please be sure to reference the bug that tracks the underlying
>>>>>>> root cause.]
>>>>>>
>>>>>> You are say

Re: [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-10-12 Thread Kewen.Lin via Gcc-patches
Hi Bill!

on 2021/10/13 上午12:36, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 10/11/21 1:30 AM, Kewen.Lin wrote:
>> Hi Segher,
>>
>> Thanks for the comments.
>>
>> on 2021/10/1 上午6:13, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Thu, Sep 30, 2021 at 11:06:50AM +0800, Kewen.Lin wrote:
>>>
>>> [ huge snip ]
>>>
>>>> Based on the understanding and testing, I think it's safe to adopt this 
>>>> patch.
>>>> Do both Peter and you agree the rs6000_expand_builtin will catch the 
>>>> invalid built-in?
>>>> Is there some special case which probably escapes out?
>>> The function rs6000_builtin_decl has a terribly generic name.  Where all
>>> is it called from?  Do all such places allow the change in semantics?
>>> Do any comments or other documentation need to change?  Is the function
>>> name still good?
>>
>> % grep -rE "\ \(" .
>> ./gcc/config/avr/avr-c.c:  fold = targetm.builtin_decl (id, true);
>> ./gcc/config/avr/avr-c.c:  fold = targetm.builtin_decl (id, true);
>> ./gcc/config/avr/avr-c.c:  fold = targetm.builtin_decl (id, true);
>> ./gcc/config/aarch64/aarch64.c:  return aarch64_sve::builtin_decl 
>> (subcode, initialize_p);
>> ./gcc/config/aarch64/aarch64-protos.h:  tree builtin_decl (unsigned, bool);
>> ./gcc/config/aarch64/aarch64-sve-builtins.cc:builtin_decl (unsigned int 
>> code, bool)
>> ./gcc/tree-streamer-in.c:  tree result = targetm.builtin_decl 
>> (fcode, true);
>>
>> % grep -rE "\ \(" .
>> ./gcc/config/rs6000/rs6000-c.c:  if (rs6000_builtin_decl 
>> (instance->bifid, false) != error_mark_node
>> ./gcc/config/rs6000/rs6000-c.c:  if (rs6000_builtin_decl 
>> (instance->bifid, false) != error_mark_node
>> ./gcc/config/rs6000/rs6000-c.c:  if (rs6000_builtin_decl 
>> (instance->bifid, false) != error_mark_node
>> ./gcc/config/rs6000/rs6000-gen-builtins.c:  "extern tree 
>> rs6000_builtin_decl (unsigned, "
>> ./gcc/config/rs6000/rs6000-call.c:rs6000_builtin_decl (unsigned code, bool 
>> initialize_p ATTRIBUTE_UNUSED)
>> ./gcc/config/rs6000/rs6000-internal.h:extern tree rs6000_builtin_decl 
>> (unsigned code,
>>
>> As above, the call sites are mainly in
>>   1) function unpack_ts_function_decl_value_fields in gcc/tree-streamer-in.c
>>   2) function altivec_resolve_new_overloaded_builtin in 
>> gcc/config/rs6000/rs6000-c.c
>>
>> 2) is newly introduced by Bill's bif rewriting patch series, all uses in it 
>> are
>> along with rs6000_new_builtin_is_supported which adopts a new way to check 
>> bif
>> supported or not (the old rs6000_builtin_is_supported_p uses builtin mask), 
>> so
>> I think the builtin mask checking is useless (unexpected?) for these uses.
> 
> Things are a bit confused because we are part way through the patch series.
> rs6000_builtin_decl will be changed to redirect to rs6000_new_builtin_decl 
> when
> using the new builtin support.  That function will be:
> 
> static tree
> rs6000_new_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
> {
>   rs6000_gen_builtins fcode = (rs6000_gen_builtins) code;
> 
>   if (fcode >= RS6000_OVLD_MAX)
> return error_mark_node;
> 
>   if (!rs6000_new_builtin_is_supported (fcode))
> {
>   rs6000_invalid_new_builtin (fcode);
>   return error_mark_node;
> }
> 
>   return rs6000_builtin_decls_x[code];
> }
> 
> So, as you surmise, this will be using the new method of testing for builtin 
> validity.
> You can ignore the rs6000-c.c and rs6000-gen-builtins.c references of 
> rs6000_builtin_decl
> for purposes of fixing the existing way of doing things.
> 

Thanks for the explanation, it makes more sense. 

>>
>> Besides, the description for this hook:
>>
>> "tree TARGET_BUILTIN_DECL (unsigned code, bool initialize_p) [Target Hook]
>> Define this hook if you have any machine-specific built-in functions that 
>> need to be
>> defined. It should be a function that returns the builtin function 
>> declaration for the
>> builtin function code code. If there is no such builtin and it cannot be 
>> initialized at
>> this time if initialize p is true the function should return NULL_TREE. If 
>> code is out
>> of range the function should return error_mark_node."
>>
>> It would only return error_mark_node when the code is out of range.  The 
>> current
>> rs6000_builtin_decl returns error_mark_node not only for "out of range&q

PING^3 [PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-10-12 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html

One related patch [1] is ready to commit, whose test cases rely on
this patch if no changes are applied to them.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579658.html

BR,
Kewen

>> on 2021/9/1 下午2:55, Kewen.Lin via Gcc-patches wrote:
>>> Hi!
>>>
>>> This patch is to fix the inconsistent behaviors for non-LTO mode
>>> and LTO mode.  As Martin pointed out, currently the function
>>> rs6000_can_inline_p simply makes it inlinable if callee_tree is
>>> NULL, but it's wrong, we should use the command line options
>>> from target_option_default_node as default.  It also replaces
>>> rs6000_isa_flags with the one from target_option_default_node
>>> when caller_tree is NULL as rs6000_isa_flags could probably
>>> change since initialization.
>>>
>>> It also extends the scope of the check for the case that callee
>>> has explicit set options, for test case pr102059-2.c inlining can
>>> happen unexpectedly before, it's fixed accordingly.
>>>
>>> As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
>>> can be neglected for inlining, this patch also exludes them when
>>> the callee is attributed by always_inline.
>>>
>>> Bootstrapped and regtested on powerpc64le-linux-gnu Power9.
>>>
>>> BR,
>>> Kewen
>>> -
>>> gcc/ChangeLog:
>>>
>>> PR ipa/102059
>>> * config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
>>> target_option_default_node and consider always_inline_safe flags.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> PR ipa/102059
>>> * gcc.target/powerpc/pr102059-1.c: New test.
>>> * gcc.target/powerpc/pr102059-2.c: New test.
>>> * gcc.target/powerpc/pr102059-3.c: New test.
>>> * gcc.target/powerpc/pr102059-4.c: New test.
>>>
>>


PING^1 [PATCH v2] rs6000: Modify the way for extra penalized cost

2021-10-12 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> This patch follows the discussions here[1][2], where Segher
> pointed out the existing way to guard the extra penalized
> cost for strided/elementwise loads with a magic bound does
> not scale.
> 
> The way with nunits * stmt_cost can get one much
> exaggerated penalized cost, such as: for V16QI on P8, it's
> 16 * 20 = 320, that's why we need one bound.  To make it
> better and more readable, the penalized cost is simplified
> as:
> 
> unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
> unsigned extra_cost = nunits * adjusted_cost;
> 
> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
> while for the other modes, it uses 1.  It's mainly concluded
> from the performance evaluations.  One thing might be
> related is that: More units vector gets constructed, more
> instructions are used.  It has more chances to schedule them
> better (even run in parallelly when enough available units
> at that time), so it seems reasonable not to penalize more
> for them.
> 
> The SPEC2017 evaluations on Power8/Power9/Power10 at option
> sets O2-vect and Ofast-unroll show this change is neutral.
> 
> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
> 
> BR,
> Kewen
> -
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>   the way to compute extra penalized cost.  Remove useless parameter.
>   (rs6000_add_stmt_cost): Adjust the call to function
>   rs6000_update_target_cost_per_stmt.
> 
> 
> ---
>  gcc/config/rs6000/rs6000.c | 31 ++-
>  1 file changed, 18 insertions(+), 13 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index dd42b0964f1..8200e1152c2 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
> *data,
>   enum vect_cost_for_stmt kind,
>   struct _stmt_vec_info *stmt_info,
>   enum vect_cost_model_location where,
> - int stmt_cost,
>   unsigned int orig_count)
>  {
> 
> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
> *data,
>   {
> tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> unsigned int nunits = vect_nunits_for_cost (vectype);
> -   unsigned int extra_cost = nunits * stmt_cost;
> -   /* As function rs6000_builtin_vectorization_cost shows, we have
> -  priced much on V16QI/V8HI vector construction as their units,
> -  if we penalize them with nunits * stmt_cost, it can result in
> -  an unreliable body cost, eg: for V16QI on Power8, stmt_cost
> -  is 20 and nunits is 16, the extra cost is 320 which looks
> -  much exaggerated.  So let's use one maximum bound for the
> -  extra penalized cost for vector construction here.  */
> -   const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> -   if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> - extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +   /* Don't expect strided/elementwise loads for just 1 nunit.  */
> +   gcc_assert (nunits > 1);
> +   /* i386 port adopts nunits * stmt_cost as the penalized cost
> +  for this kind of penalization, we used to follow it but
> +  found it could result in an unreliable body cost especially
> +  for V16QI/V8HI modes.  To make it better, we choose this
> +  new heuristic: for each scalar load, we use 2 as penalized
> +  cost for the case with 2 nunits and use 1 for the other
> +  cases.  It's without much supporting theory, mainly
> +  concluded from the broad performance evaluations on Power8,
> +  Power9 and Power10.  One possibly related point is that:
> +  vector construction for more units would use more insns,
> +  it has more chances to schedule them better (even run in
> +  parallelly when enough available units at that time), so
> +  it seems reasonable not to penalize that much for them.  */
> +   unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
> +   uns

PING^4 [PATCH v2] combine: Tweak the condition of last_set invalidation

2021-10-12 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html

BR,
Kewen

>>> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote:
>>>> Hi Segher,
>>>>
>>>> Thanks for the review!
>>>>
>>>> on 2021/6/10 上午4:17, Segher Boessenkool wrote:
>>>>> Hi!
>>>>>
>>>>> On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
>>>>>> Currently we have the check:
>>>>>>
>>>>>>   if (!insn
>>>>>>|| (value && rsp->last_set_table_tick >= label_tick_ebb_start))
>>>>>>  rsp->last_set_invalid = 1; 
>>>>>>
>>>>>> which means if we want to record some value for some reg and
>>>>>> this reg got refered before in a valid scope,
>>>>>
>>>>> If we already know it is *set* in this same extended basic block.
>>>>> Possibly by the same instruction btw.
>>>>>
>>>>>> we invalidate the
>>>>>> set of reg (last_set_invalid to 1).  It avoids to find the wrong
>>>>>> set for one reg reference, such as the case like:
>>>>>>
>>>>>>... op regX  // this regX could find wrong last_set below
>>>>>>regX = ...   // if we think this set is valid
>>>>>>... op regX
>>>>>
>>>>> Yup, exactly.
>>>>>
>>>>>> But because of retry's existence, the last_set_table_tick could
>>>>>> be set by some later reference insns, but we see it's set due
>>>>>> to retry on the set (for that reg) insn again, such as:
>>>>>>
>>>>>>insn 1
>>>>>>insn 2
>>>>>>
>>>>>>regX = ... --> (a)
>>>>>>... op regX--> (b)
>>>>>>
>>>>>>insn 3
>>>>>>
>>>>>>// assume all in the same BB.
>>>>>>
>>>>>> Assuming we combine 1, 2 -> 3 sucessfully and replace them as two
>>>>>> (3 insns -> 2 insns),
>>>>>
>>>>> This will delete insn 1 and write the combined result to insns 2 and 3.
>>>>>
>>>>>> retrying from insn1 or insn2 again:
>>>>>
>>>>> Always 2, but your point remains valid.
>>>>>
>>>>>> it will scan insn (a) again, the below condition holds for regX:
>>>>>>
>>>>>>   (value && rsp->last_set_table_tick >= label_tick_ebb_start)
>>>>>>
>>>>>> it will mark this set as invalid set.  But actually the
>>>>>> last_set_table_tick here is set by insn (b) before retrying, so it
>>>>>> should be safe to be taken as valid set.
>>>>>
>>>>> Yup.
>>>>>
>>>>>> This proposal is to check whether the last_set_table safely happens
>>>>>> after the current set, make the set still valid if so.
>>>>>
>>>>>> Full SPEC2017 building shows this patch gets more sucessful combines
>>>>>> from 1902208 to 1902243 (trivial though).
>>>>>
>>>>> Do you have some example, or maybe even a testcase?  :-)
>>>>>
>>>>
>>>> Sorry for the late reply, it took some time to get one reduced case.
>>>>
>>>> typedef struct SA *pa_t;
>>>>
>>>> struct SC {
>>>>   int h;
>>>>   pa_t elem[];
>>>> };
>>>>
>>>> struct SD {
>>>>   struct SC *e;
>>>> };
>>>>
>>>> struct SA {
>>>>   struct {
>>>> struct SD f[1];
>>>>   } g;
>>>> };
>>>>
>>>> void foo(pa_t *k, char **m) {
>>>>   int l, i;
>>>>   pa_t a;
>>>>   l = (int)a->g.f[5].e;
>>>>   i = 0;
>>>>   for (; i < l; i++) {
>>>> k[i] = a->g.f[5].e->elem[i];
>>>> m[i] = "";
>>>>   }
>>>> }
>>>>
>>>> Baseline is r12-0 and the option is "-O3 -mcpu=power9 
>>>> -fno-strict-aliasing",
>>>> with this patch, the generated assembly can save two rlwinm s.
>>>>
>>>>>> +  /* Record the luid of the insn whose expression involving regist

Re: [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-10-11 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the comments.

on 2021/10/1 上午6:13, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Sep 30, 2021 at 11:06:50AM +0800, Kewen.Lin wrote:
> 
> [ huge snip ]
> 
>> Based on the understanding and testing, I think it's safe to adopt this 
>> patch.
>> Do both Peter and you agree the rs6000_expand_builtin will catch the invalid 
>> built-in?
>> Is there some special case which probably escapes out?
> 
> The function rs6000_builtin_decl has a terribly generic name.  Where all
> is it called from?  Do all such places allow the change in semantics?
> Do any comments or other documentation need to change?  Is the function
> name still good?


% grep -rE "\ \(" .
./gcc/config/avr/avr-c.c:  fold = targetm.builtin_decl (id, true);
./gcc/config/avr/avr-c.c:  fold = targetm.builtin_decl (id, true);
./gcc/config/avr/avr-c.c:  fold = targetm.builtin_decl (id, true);
./gcc/config/aarch64/aarch64.c:  return aarch64_sve::builtin_decl (subcode, 
initialize_p);
./gcc/config/aarch64/aarch64-protos.h:  tree builtin_decl (unsigned, bool);
./gcc/config/aarch64/aarch64-sve-builtins.cc:builtin_decl (unsigned int code, 
bool)
./gcc/tree-streamer-in.c:  tree result = targetm.builtin_decl (fcode, 
true);

% grep -rE "\ \(" .
./gcc/config/rs6000/rs6000-c.c: if (rs6000_builtin_decl (instance->bifid, 
false) != error_mark_node
./gcc/config/rs6000/rs6000-c.c: if (rs6000_builtin_decl (instance->bifid, 
false) != error_mark_node
./gcc/config/rs6000/rs6000-c.c: if (rs6000_builtin_decl (instance->bifid, 
false) != error_mark_node
./gcc/config/rs6000/rs6000-gen-builtins.c: "extern tree 
rs6000_builtin_decl (unsigned, "
./gcc/config/rs6000/rs6000-call.c:rs6000_builtin_decl (unsigned code, bool 
initialize_p ATTRIBUTE_UNUSED)
./gcc/config/rs6000/rs6000-internal.h:extern tree rs6000_builtin_decl (unsigned 
code,

As above, the call sites are mainly in
  1) function unpack_ts_function_decl_value_fields in gcc/tree-streamer-in.c
  2) function altivec_resolve_new_overloaded_builtin in 
gcc/config/rs6000/rs6000-c.c

2) is newly introduced by Bill's bif rewriting patch series, all uses in it are
along with rs6000_new_builtin_is_supported which adopts a new way to check bif
supported or not (the old rs6000_builtin_is_supported_p uses builtin mask), so
I think the builtin mask checking is useless (unexpected?) for these uses.

Besides, the description for this hook:

"tree TARGET_BUILTIN_DECL (unsigned code, bool initialize_p) [Target Hook]
Define this hook if you have any machine-specific built-in functions that need 
to be
defined. It should be a function that returns the builtin function declaration 
for the
builtin function code code. If there is no such builtin and it cannot be 
initialized at
this time if initialize p is true the function should return NULL_TREE. If code 
is out
of range the function should return error_mark_node."

It would only return error_mark_node when the code is out of range.  The current
rs6000_builtin_decl returns error_mark_node not only for "out of range", it 
looks
inconsistent and this patch also revise it.

The hook was introduced by commit e9e4b3a892d0d19418f23bb17bdeac33f9a8bfd2,
it meant to ensure the bif function_decl is valid (check if bif code in the
range and the corresponding entry in bif table is not NULL).  May be better
with name check_and_get_builtin_decl?  CC Richi, he may have more insights.

> 
>> By the way, I tested the bif rewriting patch series V5, it couldn't make the 
>> original
>> case in PR (S5) pass, I may miss something or the used series isn't 
>> up-to-date.  Could
>> you help to have a try?  I agree with Peter, if the rewriting can fix this 
>> issue, then
>> we don't need this patch for trunk any more, I'm happy to abandon this.  :)
> 
> (Mail lines are 70 or so chars max, so that they can be quoted a few
> levels).
> 

ah, OK, thanks.  :)

> If we do need a band-aid for 10 and 11 (and we do as far as I can see),
> I'd like to see one for just MMA there, and let all other badness fade
> into history.  Unless you can convince me (in the patch / commit
> message) that this is safe :-)

Just to fix for MMA seems incomplete to me since we can simply
construct one non-MMA but failed case.  I questioned in the other
thread, is there any possibility for one invalid target specific
bif to escape from the function rs6000_expand_builtin?  (note that
folding won't handle invalid bifs, so invalid bifs won't get folded
early.)  If no, I think it would be safe.

> 
> Whichever way you choose, it is likely best to do the same on 10 and 11
> as on trunk, since it will all be replaced on trunk soon anyway.
> 

OK, will see Bill's reply (he should be back from vacation soon).  :)

BR,
Kewen


[PATCH] rs6000/test: Adjust some cases due to O2 vect [PR102658]

2021-10-10 Thread Kewen.Lin via Gcc-patches
Hi,

As PR102658 shows, commit r12-4240 enables vectorization at O2,
some cases need to be adjusted accordingly for rs6000 port.

- For target specific test cases, this adds -fno-tree-vectorize
to retain original test points, otherwise vectorization can
make some expected scalar instructions gone or generate some
unexpected instructions for vector construction.

- For generic test cases, it follows the existing suggested
practice with necessary target/xfail selector.

Tested with expected results on powerpc64le-linux-gnu and
powerpc64-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-
gcc/testsuite/ChangeLog:

PR testsuite/102658
* c-c++-common/Wstringop-overflow-2.c: Adjust for rs6000 port.
* g++.dg/warn/Wuninitialized-13.C: Likewise.
* gcc.dg/Warray-parameter-3.c: Likewise.
* gcc.dg/Wstringop-overflow-21.c: Likewise.
* gcc.dg/Wstringop-overflow-68.c: Likewise.
* gcc.dg/Wstringop-overflow-76.c: Likewise.
* gcc.target/powerpc/dform-1.c: Adjust as vectorization enabled at O2.
* gcc.target/powerpc/dform-2.c: Likewise.
* gcc.target/powerpc/pr80510-2.c: Likewise.

---

diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c 
b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
index 7d29b5f48c7..5d83caddc4e 100644
--- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
+++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
@@ -221,10 +221,10 @@ void ga1_1 (void)
   a1_1.a[1] = 1;// { dg-warning "\\\[-Wstringop-overflow" }
   a1_1.a[2] = 2;// { dg-warning "\\\[-Wstringop-overflow" }

-  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* } } }
+  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* powerpc*-*-* } } }
   a.a[0] = 0;
-  a.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
-  a.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
+  a.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* powerpc*-*-* } } }
+  a.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* powerpc*-*-* } } }
   sink ();
 }

@@ -320,10 +320,10 @@ void ga1i_1 (void)
   a1i_1.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" }
   a1i_1.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" }

-  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* } } }
+  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* powerpc*-*-* } } }
   a.a[0] = 1;
-  a.a[1] = 2;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
-  a.a[2] = 3;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
+  a.a[1] = 2;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* powerpc*-*-* } } }
+  a.a[2] = 3;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* powerpc*-*-* } } }
   sink ();
 }

diff --git a/gcc/testsuite/g++.dg/warn/Wuninitialized-13.C 
b/gcc/testsuite/g++.dg/warn/Wuninitialized-13.C
index 210e74c3c3b..4ad897a6486 100644
--- a/gcc/testsuite/g++.dg/warn/Wuninitialized-13.C
+++ b/gcc/testsuite/g++.dg/warn/Wuninitialized-13.C
@@ -5,7 +5,7 @@
 struct shared_count {
   shared_count () { }
   shared_count (shared_count )
-: pi (r.pi) { } // { dg-warning "\\\[-Wuninitialized" "" { xfail { 
i?86-*-* x86_64-*-* } } }
+: pi (r.pi) { } // { dg-warning "\\\[-Wuninitialized" "" { xfail { 
i?86-*-* x86_64-*-* powerpc*-*-* } } }
   int pi;
 };

diff --git a/gcc/testsuite/gcc.dg/Warray-parameter-3.c 
b/gcc/testsuite/gcc.dg/Warray-parameter-3.c
index e8a269c85c6..f7404be8742 100644
--- a/gcc/testsuite/gcc.dg/Warray-parameter-3.c
+++ b/gcc/testsuite/gcc.dg/Warray-parameter-3.c
@@ -77,7 +77,7 @@ gia3 (int a[3])
 __attribute__ ((noipa)) void
 gcas3 (char a[static 3])
 {
-  a[0] = 0; a[1] = 1; a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* } } }
+  a[0] = 0; a[1] = 1; a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* powerpc*-*-* } } }
   a[3] = 3;   // { dg-warning "\\\[-Warray-bounds" }
 }

diff --git a/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c 
b/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c
index d88bde9c740..2db6a52b22b 100644
--- a/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c
+++ b/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c
@@ -23,10 +23,10 @@ void test_store_zero_length (int i)
 {
   char a[3];
   struct S0 *p = (struct S0*)a;
-  p->a = 0; // { dg-warning "\\\[-Wstringop-overflow" 

Re: [PATCH] Adjust testcase for O2 vectorization enabling

2021-10-10 Thread Kewen.Lin via Gcc-patches
Hi Hongtao,

on 2021/10/11 上午10:10, liuhongt via Gcc-patches wrote:
> libgomp/ChangeLog:
> 
>   * testsuite/libgomp.graphite/force-parallel-8.c: Add 
> -fno-tree-vectorize.
> ---
>  libgomp/testsuite/libgomp.graphite/force-parallel-8.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/libgomp/testsuite/libgomp.graphite/force-parallel-8.c 
> b/libgomp/testsuite/libgomp.graphite/force-parallel-8.c
> index f9e07039172..0f3d138cebd 100644
> --- a/libgomp/testsuite/libgomp.graphite/force-parallel-8.c
> +++ b/libgomp/testsuite/libgomp.graphite/force-parallel-8.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-fdisable-tree-thread1 
> -fdisable-tree-vrp-thread1" } */
> +/* { dg-additional-options "-fdisable-tree-thread1 
> -fdisable-tree-vrp-thread1 -fno-tree-vectorize" } */
>  
>  #define N 1500
>  
> 

Thanks for doing this!

This issue was observed in rs6000 specific PR102658 as well.

I've looked into it a bit, it's caused by the "conditional store replacement" 
which
is originally disabled without vectorization as below code.

  /* If either vectorization or if-conversion is disabled then do
 not sink any stores.  */
  if (param_max_stores_to_sink == 0
  || (!flag_tree_loop_vectorize && !flag_tree_slp_vectorize)
  || !flag_tree_loop_if_convert)
return false;

The new change makes the innermost loop look like

for (int c1 = 0; c1 <= 1499; c1 += 1) {
  if (c1 <= 500) {
 S_10(c0, c1);
  } else {
  S_9(c0, c1);
  }
  S_11(c0, c1);
} 

and can not be splitted as:

for (int c1 = 0; c1 <= 500; c1 += 1)
  S_10(c0, c1);

for (int c1 = 501; c1 <= 1499; c1 += 1)
  S_9(c0, c1);

So instead of disabling vectorization, could we just disable this cs replacement
with parameter "--param max-stores-to-sink=0"?

I tested this proposal on ppc64le, it should work as well.

What do you think of it?

BR,
Kewen


[PATCH] testsuite: Add missing comment for some dg-warning

2021-10-09 Thread Kewen.Lin via Gcc-patches
Hi,

This patch fixes the typos introduced by commit r12-4240.

The dg-warning format looks like:

{ dg-warning regexp [comment [{ target/xfail selector } [line] ]] }

Some dg-warnings such as:

{ dg-warning "\\\[-Wstringop-overflow" { target { i?86-*-* x86_64-*-* } } }

miss the comment field, it makes target selector not take effect.

For targets which are not { i?86-*-* x86_64-*-* }, this kind of cases
fail or pass unexpectedly.

Is it ok for trunk?

BR,
Kewen
---
gcc/testsuite/ChangeLog:

* c-c++-common/Wstringop-overflow-2.c: Add missing comment.
* gcc.dg/Warray-bounds-51.c: Likewise.
* gcc.dg/Warray-parameter-3.c: Likewise.
* gcc.dg/Wstringop-overflow-14.c: Likewise.
* gcc.dg/Wstringop-overflow-21.c: Likewise.
* gcc.dg/Wstringop-overflow-76.c: Likewise.

-
diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c 
b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
index 7e9da8a02cb..7d29b5f48c7 100644
--- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
+++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
@@ -221,7 +221,7 @@ void ga1_1 (void)
   a1_1.a[1] = 1;// { dg-warning "\\\[-Wstringop-overflow" }
   a1_1.a[2] = 2;// { dg-warning "\\\[-Wstringop-overflow" }

-  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" { 
target { i?86-*-* x86_64-*-* } } }
+  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* } } }
   a.a[0] = 0;
   a.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
   a.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
@@ -320,7 +320,7 @@ void ga1i_1 (void)
   a1i_1.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" }
   a1i_1.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" }

-  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" { 
target { i?86-*-* x86_64-*-* } } }
+  struct A1 a = { 0, { 1 } };   // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* } } }
   a.a[0] = 1;
   a.a[1] = 2;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
   a.a[2] = 3;   // { dg-warning "\\\[-Wstringop-overflow" "" { 
xfail { i?86-*-* x86_64-*-* } } }
diff --git a/gcc/testsuite/gcc.dg/Warray-bounds-51.c 
b/gcc/testsuite/gcc.dg/Warray-bounds-51.c
index b0b8bdb7938..c12f1407385 100644
--- a/gcc/testsuite/gcc.dg/Warray-bounds-51.c
+++ b/gcc/testsuite/gcc.dg/Warray-bounds-51.c
@@ -38,7 +38,7 @@ void test_struct_char_vla_location (void)
   } s;

   s.cvla[0] = __LINE__;
-  s.cvla[nelts - 1] = 0; // { dg-warning "\\\[-Wstringop-overflow" { target { 
i?86-*-* x86_64-*-* } } }
+  s.cvla[nelts - 1] = 0; // { dg-warning "\\\[-Wstringop-overflow" "" { target 
{ i?86-*-* x86_64-*-* } } }
   s.cvla[nelts] = 0;  // { dg-warning "\\\[-Warray-bounds" }

   sink ();
diff --git a/gcc/testsuite/gcc.dg/Warray-parameter-3.c 
b/gcc/testsuite/gcc.dg/Warray-parameter-3.c
index e2c47e1ed36..e8a269c85c6 100644
--- a/gcc/testsuite/gcc.dg/Warray-parameter-3.c
+++ b/gcc/testsuite/gcc.dg/Warray-parameter-3.c
@@ -77,7 +77,7 @@ gia3 (int a[3])
 __attribute__ ((noipa)) void
 gcas3 (char a[static 3])
 {
-  a[0] = 0; a[1] = 1; a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" { 
target { i?86-*-* x86_64-*-* } } }
+  a[0] = 0; a[1] = 1; a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" "" { 
target { i?86-*-* x86_64-*-* } } }
   a[3] = 3;   // { dg-warning "\\\[-Warray-bounds" }
 }

diff --git a/gcc/testsuite/gcc.dg/Wstringop-overflow-14.c 
b/gcc/testsuite/gcc.dg/Wstringop-overflow-14.c
index b648f5b41b1..7ac0154fc25 100644
--- a/gcc/testsuite/gcc.dg/Wstringop-overflow-14.c
+++ b/gcc/testsuite/gcc.dg/Wstringop-overflow-14.c
@@ -35,7 +35,7 @@ void test_memcpy_cond (int i)
 void test_int16 (void)
 {
   char *p = a4 + 1;
-  *(int16_t*)p = 0;// { dg-warning "writing 4 bytes into a region of size 
3" { target { i?86-*-* x86_64-*-* } } }
+  *(int16_t*)p = 0;// { dg-warning "writing 4 bytes into a region of size 
3" "" { target { i?86-*-* x86_64-*-* } } }
   *(int16_t*)(p + 2) = 0;   // { dg-warning "writing 2 bytes into a region of 
size 1" "" { xfail { i?86-*-* x86_64-*-* } } }
 }

diff --git a/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c 
b/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c
index e88f7b47894..d88bde9c740 100644
--- a/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c
+++ b/gcc/testsuite/gcc.dg/Wstringop-overflow-21.c
@@ -23,7 +23,7 @@ void test_store_zero_length (int i)
 {
   char a[3];
   struct S0 *p = (struct S0*)a;
-  p->a = 0; // { dg-warning "\\\[-Wstringop-overflow" 
{ target { i?86-*-* x86_64-*-* } } }
+  p->a = 0; // { dg-warning "\\\[-Wstringop-overflow" 
"" { target { i?86-*-* x86_64-*-* } } }
   p->b[0] = 0;
   p->b[1] = 

Re: [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-09-29 Thread Kewen.Lin via Gcc-patches
Hi Bill,

on 2021/9/29 下午7:59, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 9/28/21 9:34 PM, Kewen.Lin wrote:
>> Hi Bill,
>>
>> Thanks for your prompt comments!
>>
>> on 2021/9/29 上午3:24, Bill Schmidt wrote:
>>> Hi Kewen,
>>>
>>> Although I agree that what we do now is tragically bad (and will be fixed 
>>> in the builtin rewrite), this seems a little too cavalier to remove all 
>>> checking during initialization without adding any checking somewhere else. 
>>> :-)  We still need to check for invalid usage when the builtin is expanded, 
>>> and I don't think the old code does this at all.
>>>
>> If I read the code right, there are some following places to check the 
>> invalid usage or not.
>>   1) for folding, rs6000_gimple_fold_builtin -> 
>> rs6000_builtin_is_supported_p -> check mask
>>   -> defer to expand if invalid.
>>   2) for expanding, obtain func_valid_p, error in rs6000_invalid_builtin.
>>
>> Both places seem to exist before the builtin rewrite, am I missing something?
>>
>> btw, I remembered I used one built gcc with my fix to compile one test case 
>> which is supposed to fail
>> due to its invalid usage builtin at option -flto, it failed (errored) as 
>> expected but at LTRANS phase
>> since it's the time to do expansion for no-fat-objs scenario.
> 
> OK.  If you are comfortable that this will be caught when the builtin is 
> actually not valid, then I'll
> withdraw my objection.  Can you test it?  I know that we've been trying to 
> fix these cases piecemeal
> in the old support, and as Peter says it's important to backport this, we 
> need the solution.  I just
> want to be sure we're not breaking something, and test coverage in this area 
> is pretty terrible.
> 

Thanks for the comments and the trust!  I found I missed to type the function 
name
rs6000_expand_builtin for expanding part, specifically the function has:

...
  bool func_valid_p = ((rs6000_builtin_mask & mask) == mask);
...
  if (!func_valid_p)
{
  rs6000_invalid_builtin (fcode);   // It emits error here.

  /* Given it is invalid, just generate a normal call.  */
  return expand_call (exp, target, ignore);
}

IIUC, all invalid built-ins will eventually be caught by this function (as 
mentioned
before, the built-in gimple folding would bypass the invalid built-ins).

I tested the below case:

#ifndef EXPECT_ERROR
#pragma GCC target "cpu=power10"
#endif
int main() {
  float *b;
  __vector_quad c;
  __builtin_mma_disassemble_acc(b, );
  return 0;
}

Option set 1 (S1): -mcpu=power9 -c
Option set 2 (S2): -mcpu=power9 -c -DEXPECT_ERROR
Option set 3 (S3): -mcpu=power9 -c -flto
Option set 4 (S4): -mcpu=power9 -c -flto -DEXPECT_ERROR
Option set 5 (S5): -mcpu=power9 -flto (lto linking)
Option set 6 (S6): -mcpu=power9 -flto -DEXPECT_ERROR (lto linking)
Option set 7 (S7): -mcpu=power9 -c -flto -ffat-lto-objects
Option set 8 (S8): -mcpu=power9 -c -flto -ffat-lto-objects -DEXPECT_ERROR

  w/o fix  w/ fix
S1PASS PASS
S2ERRORERROR
S3PASS PASS
S4PASS PASS
S5ERRORPASS
S6ERRORERROR
S7PASS PASS
S8ERRORERROR

As above, this patch fixes the unexpected error in S5 and keeps the other 
PASS/ERROR
as the original.  Note that S4 PASS is expected since expansion isn't needed 
when
generating non-fat lto objects, the error happens during linking (S6).

Based on the understanding and testing, I think it's safe to adopt this patch.
Do both Peter and you agree the rs6000_expand_builtin will catch the invalid 
built-in?
Is there some special case which probably escapes out?

By the way, I tested the bif rewriting patch series V5, it couldn't make the 
original
case in PR (S5) pass, I may miss something or the used series isn't up-to-date. 
 Could
you help to have a try?  I agree with Peter, if the rewriting can fix this 
issue, then
we don't need this patch for trunk any more, I'm happy to abandon this.  :)

BR,
Kewen

> Thanks!
> Bill
> 
>>
>>> Unless you are planning to do a backport, I think the proper way forward 
>>> here is to just wait for the new builtin support to land.  In the new code, 
>>> we initialize all built-ins up front, and check properly at expansion time 
>>> whether the builtin is enabled in the environment that obtains during 
>>> expand.
>> Good to know that!  Nice!  btw, for this issue itself, the current 
>> implementation (without rewriting)
>> also initializes the built-ins in the table since MMA built-ins guarded in 
>> TARGET_EXTRA_BUILTINS,
>> the r

Re: [PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-09-28 Thread Kewen.Lin via Gcc-patches
Hi Bill,

Thanks for your prompt comments!

on 2021/9/29 上午3:24, Bill Schmidt wrote:
> Hi Kewen,
> 
> Although I agree that what we do now is tragically bad (and will be fixed in 
> the builtin rewrite), this seems a little too cavalier to remove all checking 
> during initialization without adding any checking somewhere else. :-)  We 
> still need to check for invalid usage when the builtin is expanded, and I 
> don't think the old code does this at all.
> 

If I read the code right, there are some following places to check the invalid 
usage or not.
  1) for folding, rs6000_gimple_fold_builtin -> rs6000_builtin_is_supported_p 
-> check mask
  -> defer to expand if invalid.
  2) for expanding, obtain func_valid_p, error in rs6000_invalid_builtin.

Both places seem to exist before the builtin rewrite, am I missing something?

btw, I remembered I used one built gcc with my fix to compile one test case 
which is supposed to fail
due to its invalid usage builtin at option -flto, it failed (errored) as 
expected but at LTRANS phase
since it's the time to do expansion for no-fat-objs scenario.

> Unless you are planning to do a backport, I think the proper way forward here 
> is to just wait for the new builtin support to land.  In the new code, we 
> initialize all built-ins up front, and check properly at expansion time 
> whether the builtin is enabled in the environment that obtains during expand.

Good to know that!  Nice!  btw, for this issue itself, the current 
implementation (without rewriting)
also initializes the built-ins in the table since MMA built-ins guarded in 
TARGET_EXTRA_BUILTINS,
the root cause is the rs6000_builtin_mask can't set up (be switched) expectedly 
since the checking
time is too early right when the built-in function_decl being created.

BR,
Kewen

> 
> My two cents,
> Bill
> 
> On 9/28/21 3:13 AM, Kewen.Lin wrote:
>> Hi,
>>
>> As the discussion in PR102347, currently builtin_decl is invoked so
>> early, it's when making up the function_decl for builtin functions,
>> at that time the rs6000_builtin_mask could be wrong for those
>> builtins sitting in #pragma/attribute target functions, though it
>> will be updated properly later when LTO processes all nodes.
>>
>> This patch is to align with the practice i386 port adopts, also
>> align with r10-7462 by relaxing builtin mask checking in some places.
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu P9 and
>> powerpc64-linux-gnu P8.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  PR target/102347
>>  * config/rs6000/rs6000-call.c (rs6000_builtin_decl): Remove builtin
>>  mask check.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  PR target/102347
>>  * gcc.target/powerpc/pr102347.c: New test.
>>
>> ---
>>  gcc/config/rs6000/rs6000-call.c | 14 --
>>  gcc/testsuite/gcc.target/powerpc/pr102347.c | 15 +++
>>  2 files changed, 19 insertions(+), 10 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102347.c
>>
>> diff --git a/gcc/config/rs6000/rs6000-call.c 
>> b/gcc/config/rs6000/rs6000-call.c
>> index fd7f24da818..15e0e09c07d 100644
>> --- a/gcc/config/rs6000/rs6000-call.c
>> +++ b/gcc/config/rs6000/rs6000-call.c
>> @@ -13775,23 +13775,17 @@ rs6000_init_builtins (void)
>>  }
>>  }
>>
>> -/* Returns the rs6000 builtin decl for CODE.  */
>> +/* Returns the rs6000 builtin decl for CODE.  Note that we don't check
>> +   the builtin mask here since there could be some #pragma/attribute
>> +   target functions and the rs6000_builtin_mask could be wrong when
>> +   this checking happens, though it will be updated properly later.  */
>>
>>  tree
>>  rs6000_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
>>  {
>> -  HOST_WIDE_INT fnmask;
>> -
>>if (code >= RS6000_BUILTIN_COUNT)
>>  return error_mark_node;
>>
>> -  fnmask = rs6000_builtin_info[code].mask;
>> -  if ((fnmask & rs6000_builtin_mask) != fnmask)
>> -{
>> -  rs6000_invalid_builtin ((enum rs6000_builtins)code);
>> -  return error_mark_node;
>> -}
>> -
>>return rs6000_builtin_decls[code];
>>  }
>>
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr102347.c 
>> b/gcc/testsuite/gcc.target/powerpc/pr102347.c
>> new file mode 100644
>> index 000..05c439a8dac
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr102347.c
>> @@ -0,0 +1,15 @@
>> +/* { dg-do link } */
>> +/* { dg-require-effective-target power10_ok } */
>> +/* { dg-require-effective-target lto } */
>> +/* { dg-options "-flto -mdejagnu-cpu=power9" } */
>> +
>> +/* Verify there are no error messages in LTO mode.  */
>> +
>> +#pragma GCC target "cpu=power10"
>> +int main ()
>> +{
>> +  float *b;
>> +  __vector_quad c;
>> +  __builtin_mma_disassemble_acc (b, );
>> +  return 0;
>> +}
>> --
>> 2.27.0
>>
> 




PING^2 [PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-09-28 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html

One related patch [1] is ready to commit, whose test cases rely on
this patch if no changes are applied to them.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579658.html

BR,
Kewen

on 2021/9/15 下午4:42, Kewen.Lin via Gcc-patches wrote:
> Hi!
> 
> Gentle ping this patch:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html
> 
> BR,
> Kewen
> 
> on 2021/9/1 下午2:55, Kewen.Lin via Gcc-patches wrote:
>> Hi!
>>
>> This patch is to fix the inconsistent behaviors for non-LTO mode
>> and LTO mode.  As Martin pointed out, currently the function
>> rs6000_can_inline_p simply makes it inlinable if callee_tree is
>> NULL, but it's wrong, we should use the command line options
>> from target_option_default_node as default.  It also replaces
>> rs6000_isa_flags with the one from target_option_default_node
>> when caller_tree is NULL as rs6000_isa_flags could probably
>> change since initialization.
>>
>> It also extends the scope of the check for the case that callee
>> has explicit set options, for test case pr102059-2.c inlining can
>> happen unexpectedly before, it's fixed accordingly.
>>
>> As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
>> can be neglected for inlining, this patch also exludes them when
>> the callee is attributed by always_inline.
>>
>> Bootstrapped and regtested on powerpc64le-linux-gnu Power9.
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  PR ipa/102059
>>  * config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
>>  target_option_default_node and consider always_inline_safe flags.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  PR ipa/102059
>>  * gcc.target/powerpc/pr102059-1.c: New test.
>>  * gcc.target/powerpc/pr102059-2.c: New test.
>>  * gcc.target/powerpc/pr102059-3.c: New test.
>>  * gcc.target/powerpc/pr102059-4.c: New test.
>>
> 


Re: [PATCH] rs6000: Modify the way for extra penalized cost

2021-09-28 Thread Kewen.Lin via Gcc-patches
Hi Segher,

on 2021/9/23 上午6:36, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Sep 21, 2021 at 11:24:08AM +0800, Kewen.Lin wrote:
>> on 2021/9/18 上午6:01, Segher Boessenkool wrote:
>>> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
>>>> The way with nunits * stmt_cost can get one much exaggerated
>>>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>>>> that's why we need one bound.  To make it scale, this patch
>>>> doesn't use nunits * stmt_cost any more, but it still keeps
>>>> nunits since there are actually nunits scalar loads there.  So
>>>> it uses one cost adjusted from stmt_cost, since the current
>>>> stmt_cost sort of considers nunits, we can stablize the cost
>>>> for big nunits and retain the cost for small nunits.  After
>>>> some tries, this patch gets the adjusted cost as:
>>>>
>>>> stmt_cost / (log2(nunits) * log2(nunits))
>>>
>>> So for  V16QI it gives *16/(4*4) so *1
>>> V8HI  it gives *8/(3*3)  so *8/9
>>> V4SI  it gives *4/(2*2)  so *1
>>> V2DI  it gives *2/(1*1)  so *2
>>> and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
>>> just gives wildly wrong answers; the div returns 0 on recent systems).
>>
>> I don't expected we will have V1TI for strided/elementwise load,
>> if it's one unit vector, it's the whole vector itself.
>> Besides, the below assertion should exclude it already.
> 
> Yes.  But ignoring the UB for unexpectedly large vector components, the
> 1 / 1.111 / 1 / 2  scoring does not make much sense.  The formulas
> "look" smooth and even sort of reasonable, but as soon as you look at
> what it *means*, and realise the domain if the function is discrete
> (only four or five possible inputs), and then see how the function
> behaves on that...  Hrm :-)
> 
>>> This of course is assuming nunits will always be a power of 2, but I'm
>>> sure that we have many other places in the compiler assuming that
>>> already, so that is fine.  And if one day this stops being true we will
>>> get a nice ICE, pretty much the best we could hope for.
>>
>> Yeah, exact_log2 returns -1 for non power of 2 input, for example:
> 
> Exactly.
> 
>>>> +unsigned int adjusted_cost = stmt_cost / nunits_sq;
>>>
>>> But this can divide by 0.  Or are we somehow guaranteed that nunits
>>> will never be 1?  Yes the log2 check above, sure, but that ICEs if this
>>> is violated; is there anything that actually guarantees it is true?
>>
>> As I mentioned above, I don't expect we can have nunits 1 strided/ew load,
>> and the ICE should check this and ensure dividing by zero never happens.  :)
> 
> Can you assert that *directly* then please?
> 

Fix in v2.

>>> A magic crazy formula like this is no good.  If you want to make the
>>> cost of everything but V2D* be the same, and that of V2D* be twice that,
>>> that is a weird heuristic, but we can live with that perhaps.  But that
>>> beats completely unexplained (and unexplainable) magic!
>>>
>>> Sorry.
>>
>> That's all right, thanks for the comments!  let's improve it.  :)
> 
> I like that spirit :-)
> 
>> How about just assigning 2 for V2DI and 1 for the others for the
>> penalized_cost_per_load with some detailed commentary, it should have
>> the same effect with this "magic crazy formula", but I guess it can
>> be more clear.
> 
> That is fine yes!  (Well, V2DF the same I guess?  Or you'll need very
> detailed commentary :-) )
> 
> It is fine to say "this is just a heuristic without much supporting
> theory" in places.  That is what most of our --param= are as well, for
> example.  If counting two-element vectors as twice as expensive as all
> other vectors helps performance, then so be it: if there is no better
> way to cost things (or we do not know one), then what else are we to do?
> 
> 

Thanks a lot for the suggestion, I just posted v2:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen


[PATCH v2] rs6000: Modify the way for extra penalized cost

2021-09-28 Thread Kewen.Lin via Gcc-patches
Hi,

This patch follows the discussions here[1][2], where Segher
pointed out the existing way to guard the extra penalized
cost for strided/elementwise loads with a magic bound does
not scale.

The way with nunits * stmt_cost can get one much
exaggerated penalized cost, such as: for V16QI on P8, it's
16 * 20 = 320, that's why we need one bound.  To make it
better and more readable, the penalized cost is simplified
as:

unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
unsigned extra_cost = nunits * adjusted_cost;

For V2DI/V2DF, it uses 2 penalized cost for each scalar load
while for the other modes, it uses 1.  It's mainly concluded
from the performance evaluations.  One thing might be
related is that: More units vector gets constructed, more
instructions are used.  It has more chances to schedule them
better (even run in parallelly when enough available units
at that time), so it seems reasonable not to penalize more
for them.

The SPEC2017 evaluations on Power8/Power9/Power10 at option
sets O2-vect and Ofast-unroll show this change is neutral.

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
the way to compute extra penalized cost.  Remove useless parameter.
(rs6000_add_stmt_cost): Adjust the call to function
rs6000_update_target_cost_per_stmt.


---
 gcc/config/rs6000/rs6000.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index dd42b0964f1..8200e1152c2 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
*data,
enum vect_cost_for_stmt kind,
struct _stmt_vec_info *stmt_info,
enum vect_cost_model_location where,
-   int stmt_cost,
unsigned int orig_count)
 {

@@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
*data,
{
  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
  unsigned int nunits = vect_nunits_for_cost (vectype);
- unsigned int extra_cost = nunits * stmt_cost;
- /* As function rs6000_builtin_vectorization_cost shows, we have
-priced much on V16QI/V8HI vector construction as their units,
-if we penalize them with nunits * stmt_cost, it can result in
-an unreliable body cost, eg: for V16QI on Power8, stmt_cost
-is 20 and nunits is 16, the extra cost is 320 which looks
-much exaggerated.  So let's use one maximum bound for the
-extra penalized cost for vector construction here.  */
- const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
- if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
-   extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+ /* Don't expect strided/elementwise loads for just 1 nunit.  */
+ gcc_assert (nunits > 1);
+ /* i386 port adopts nunits * stmt_cost as the penalized cost
+for this kind of penalization, we used to follow it but
+found it could result in an unreliable body cost especially
+for V16QI/V8HI modes.  To make it better, we choose this
+new heuristic: for each scalar load, we use 2 as penalized
+cost for the case with 2 nunits and use 1 for the other
+cases.  It's without much supporting theory, mainly
+concluded from the broad performance evaluations on Power8,
+Power9 and Power10.  One possibly related point is that:
+vector construction for more units would use more insns,
+it has more chances to schedule them better (even run in
+parallelly when enough available units at that time), so
+it seems reasonable not to penalize that much for them.  */
+ unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
+ unsigned int extra_cost = nunits * adjusted_cost;
  data->extra_ctor_cost += extra_cost;
}
 }
@@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, 
int count,
   cost_data->cost[where] += retval;

   rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
- stmt_cost, orig_count);
+ orig_count);
 }

   return retval;
--
2.27.0



[PATCH] rs6000: Remove builtin mask check from builtin_decl [PR102347]

2021-09-28 Thread Kewen.Lin via Gcc-patches
Hi,

As the discussion in PR102347, currently builtin_decl is invoked so
early, it's when making up the function_decl for builtin functions,
at that time the rs6000_builtin_mask could be wrong for those
builtins sitting in #pragma/attribute target functions, though it
will be updated properly later when LTO processes all nodes.

This patch is to align with the practice i386 port adopts, also
align with r10-7462 by relaxing builtin mask checking in some places.

Bootstrapped and regress-tested on powerpc64le-linux-gnu P9 and
powerpc64-linux-gnu P8.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

PR target/102347
* config/rs6000/rs6000-call.c (rs6000_builtin_decl): Remove builtin
mask check.

gcc/testsuite/ChangeLog:

PR target/102347
* gcc.target/powerpc/pr102347.c: New test.

---
 gcc/config/rs6000/rs6000-call.c | 14 --
 gcc/testsuite/gcc.target/powerpc/pr102347.c | 15 +++
 2 files changed, 19 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102347.c

diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
index fd7f24da818..15e0e09c07d 100644
--- a/gcc/config/rs6000/rs6000-call.c
+++ b/gcc/config/rs6000/rs6000-call.c
@@ -13775,23 +13775,17 @@ rs6000_init_builtins (void)
 }
 }

-/* Returns the rs6000 builtin decl for CODE.  */
+/* Returns the rs6000 builtin decl for CODE.  Note that we don't check
+   the builtin mask here since there could be some #pragma/attribute
+   target functions and the rs6000_builtin_mask could be wrong when
+   this checking happens, though it will be updated properly later.  */

 tree
 rs6000_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
 {
-  HOST_WIDE_INT fnmask;
-
   if (code >= RS6000_BUILTIN_COUNT)
 return error_mark_node;

-  fnmask = rs6000_builtin_info[code].mask;
-  if ((fnmask & rs6000_builtin_mask) != fnmask)
-{
-  rs6000_invalid_builtin ((enum rs6000_builtins)code);
-  return error_mark_node;
-}
-
   return rs6000_builtin_decls[code];
 }

diff --git a/gcc/testsuite/gcc.target/powerpc/pr102347.c 
b/gcc/testsuite/gcc.target/powerpc/pr102347.c
new file mode 100644
index 000..05c439a8dac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr102347.c
@@ -0,0 +1,15 @@
+/* { dg-do link } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-require-effective-target lto } */
+/* { dg-options "-flto -mdejagnu-cpu=power9" } */
+
+/* Verify there are no error messages in LTO mode.  */
+
+#pragma GCC target "cpu=power10"
+int main ()
+{
+  float *b;
+  __vector_quad c;
+  __builtin_mma_disassemble_acc (b, );
+  return 0;
+}
--
2.27.0



Re: [PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-21 Thread Kewen.Lin via Gcc-patches
on 2021/9/21 下午5:39, Richard Biener wrote:
> On Tue, Sep 21, 2021 at 11:31 AM Martin Jambor  wrote:
>>
>> Hi,
>>
>> On Tue, Sep 21 2021, Kewen.Lin wrote:
>>> on 2021/9/17 下午7:26, Martin Jambor wrote:
>>>> On Fri, Sep 17 2021, Kewen.Lin wrote:
>> [...]
>>>>>
>>>>> Sorry that I failed to use 16 bit-fields for this, I figured out that
>>>>> the bit-fields can not be address-taken or passed as non-const reference.
>>>>> The gentype also failed to recognize uint16_t if I used uint16_t directly
>>>>> in ipa-fnsummary.h.  Finally I used unsigned int instead.
>>>>>
>>>>
>>>> well, you could have used:
>>>>
>>>>   unsigned int target_info : 16;
>>>>
>>>> for the field (and uint16_t when passed to hooks).
>>>>
>>>> But I am not sure if it is that crucial.
>>>>
>>>
>>> I may miss something, specifically I tried with:
>>>
>>> 1)
>>>
>>>   unsigned int target_info : 16;
>>>   unsigned inlinable : 1;
>>>   ...
>>>
>>>   update_ipa_fn_target_info (uint16_t &, const gimple *)
>>
>> Yeah, you would have to copy the bit-field into a temporary, pass
>> reference to that in the hook and then copy it back.  At least that is
>> what I meant but since we apparently want unsigned int everywhere, it
>> does not matter.

Ah, I misunderstood you didn't want this way since it seems inefficient.
Will realize it next time since it looks like a tradeoff.  :)

> 
> Or use a by-value interface:
> 
>  uint16_t update_ipa_fn_target_info (uint16_t in, const gimple *stmt);
> 
> with the function returning the (changed) set.

Yeah, I considered this like:

  uint16_t update_ipa_fn_target_info (const uint16_t, const gimple*, bool&)

but thought it might look weird to others at the first glances and gave up
then.  :(

BR,
Kewen


Re: [PATCH] rs6000: Parameterize some const values for density test

2021-09-21 Thread Kewen.Lin via Gcc-patches
on 2021/9/21 下午8:03, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Sep 21, 2021 at 01:47:19PM +0800, Kewen.Lin wrote:
>> on 2021/9/18 上午6:26, Segher Boessenkool wrote:
>>>> +  if (data->nloads > (unsigned int) rs6000_density_load_num_threshold
>>>> +&& load_pct > (unsigned int) rs6000_density_load_pct_threshold)
>>>
>>> Those variables are unsigned int already.  Don't cast please.
>>
>> Unfortunately this is required by bootstrapping.  The UInteger for the
>> param definition is really confusing, in the underlying implementation
>> it's still "signed".  If you grep "(unsigned) param", you can see a few
>> examples.  I guess the "UInteger" is mainly for the param value range
>> checking.
> 
> Huh, I see.  Is that a bug?  It certainly is surprising!  Please open a
> PR if you think it could/should be improved, put me on Cc:?
> 

I guessed it's not a bug, "UInteger" is more for the opt/param value range
checking, but could be improved.  PR102440 filed as you suggested.  :)

>>>> +-param=rs6000-density-pct-threshold=
>>>> +Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) 
>>>> Init(85) IntegerRange(0, 99) Param
>>>
>>> So make this and all other percentages (0, 100) please.
>>
>> I thought 99 is enough for the RHS in ">". just realized it's more clear
>> with 100.  Will fix!
> 
> 99 will work fine, but it's not the best choice for the user, who will
> expect that a percentage can be anything from 0% to 100%.
> 
>>>> +When costing for loop vectorization, we probably need to penalize the 
>>>> loop body cost if the existing cost model may not adequately reflect 
>>>> delays from unavailable vector resources.  We collect the cost for 
>>>> vectorized statements and non-vectorized statements separately, check the 
>>>> proportion of vec_cost to total cost of vec_cost and non vec_cost, and 
>>>> penalize only if the proportion exceeds the threshold specified by this 
>>>> parameter.  The default value is 85.
>>>
>>> It would be good if we can use line breaks in the source code for things
>>> like this, but I don't think we can.  This message is mainly used for
>>> "--help=param", and it is good there to have as short messages as you
>>> can.  But given the nature of params you need quite a few words often,
>>> and you do not want to say so little that things are no clear, either.
>>>
>>> So, dunno :-)
>>
>> I did some testings, the line breaks writing can still survive in the
>> "--help=param" show, the lines are concatenated with " ".  Although
>> there seems no this kind of writing practices, I am guessing you want
>> me to do line breaks for their descriptions?  If so, I will make them
>> short as the above "Target Undocumented..." line.  Or do you want it
>> to align source code ColumnLimit 80 (for these cases, it would look
>> shorter)?
> 
> It would help if was more readable in the surce code, one line of close
> to 500 columns is not very manageable :-)
> 
> But the thing that matters is what it will look like in the --help=
> output (and/or the manual).
> 

OK, I've used ColumnLimit 80 for that.  The outputs in --help= before/after
the line breaks look the same (smoother than what I can expect).  :)

Committed in r12-3767, thanks!

BR,
Kewen


Re: [PATCH] ipa-fnsummary: Remove inconsistent bp_pack_value

2021-09-21 Thread Kewen.Lin via Gcc-patches
on 2021/9/21 下午2:16, Richard Biener wrote:
> On Tue, Sep 21, 2021 at 4:09 AM Kewen.Lin  wrote:
>>
>> Hi Richi,
>>
>> Thanks for the review!
>>
>> on 2021/9/17 下午6:04, Richard Biener wrote:
>>> On Fri, Sep 17, 2021 at 12:03 PM Richard Biener
>>>  wrote:
>>>>
>>>> On Fri, Sep 17, 2021 at 11:43 AM Kewen.Lin  wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> When changing target_info with bitfield, I happened to find this
>>>>> inconsistent streaming in and out.  We have the streaming in:
>>>>>
>>>>>   bp_pack_value (, info->inlinable, 1);
>>>>>   bp_pack_value (, false, 1);
>>>>>   bp_pack_value (, info->fp_expressions, 1);
>>>>>
>>>>> while the streaming out:
>>>>>
>>>>>   info->inlinable = bp_unpack_value (, 1);
>>>>>   info->fp_expressions = bp_unpack_value (, 1)
>>>>>
>>>>> The cleanup of Cilk Plus support seemed to miss to remove the bit
>>>>> streaming out but change with streaming false.
>>>>>
>>>>> By hacking fp_expression_p to return true always, I can see it
>>>>> reads the wrong fp_expressions value (false) out in wpa dumping.
>>>>>
>>>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>>>
>>>>> Is it ok for trunk?
>>>>
>>>> OK for trunk and all affected branches (note we need to bump the
>>>> LTO minor version there).  The issue comes from the removal
>>>> of cilk+ in r8-4956 which removed the bp_unpack but replaced
>>>> the bp_pack ...
>>>>
>>>> It's a correctness issue as we'll read fp_expressions as always 'false'
>>>
>>> Btw, on branches we could also simply unpack a dummy bit to avoid
>>> changing the format.
>>>
>>
>> Committed in r12-3721.  Thanks!
>>
>> As suggested, the patch for branches is listed below.
>>
>> Is ok for branches 9, 10 and 11 after some trunk burn in time?
> 
> It's OK for branches without waiting, maybe you can do a LTO bootstrap
> on the branches for extra safety (just in case we're triggering some hidden
> issues due to the fix).

Thanks!  LTO bootstrapped on the branches, separately committed via
r11-9024, r10-10146 and r9-9740.

BR,
Kewen

> 
> Thanks,
> Richard.
> 
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> * ipa-fnsummary.c (inline_read_section): Unpack a dummy bit
>> to keep consistent with the side of streaming out.
>>
>> ---
>> diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c
>> index 18bbae145b9..bf635c1f78a 100644
>> --- a/gcc/ipa-fnsummary.c
>> +++ b/gcc/ipa-fnsummary.c
>> @@ -4403,13 +4403,20 @@ inline_read_section (struct lto_file_decl_data 
>> *file_data, const char *data,
>>bp = streamer_read_bitpack ();
>>if (info)
>> {
>> -  info->inlinable = bp_unpack_value (, 1);
>> -  info->fp_expressions = bp_unpack_value (, 1);
>> + info->inlinable = bp_unpack_value (, 1);
>> + /* On the side of streaming out, there is still one bit
>> +streamed out between inlinable and fp_expressions bits,
>> +which was used for cilk+ before but now always false.
>> +To remove the bit packing need to bump LTO minor version,
>> +so unpack a dummy bit here to keep consistent instead.  */
>> + bp_unpack_value (, 1);
>> + info->fp_expressions = bp_unpack_value (, 1);
>> }
>>else
>> {
>> -  bp_unpack_value (, 1);
>> -  bp_unpack_value (, 1);
>> + bp_unpack_value (, 1);
>> + bp_unpack_value (, 1);
>> + bp_unpack_value (, 1);
>> }
>>
>>count2 = streamer_read_uhwi ();
>>


Re: [PATCH v3] ipa-inline: Add target info into fn summary [PR102059]

2021-09-21 Thread Kewen.Lin via Gcc-patches
Hi Martin,

Thanks for the review.

on 2021/9/18 下午7:31, Martin Jambor wrote:
> Hi,
> 
> On Fri, Sep 17 2021, Segher Boessenkool wrote:
>> On Fri, Sep 17, 2021 at 05:42:38PM +0800, Kewen.Lin wrote:
>>> Against v2 [2], this v3 addressed Martin's review comments:
>>>   - Replace HWI auto_vec with unsigned int for target_info
>>> to avoid overkill (also Segher's comments), adjust some
>>> places need to be updated for this change.
>>
>> I'd have used a single HWI (always 64 bits), but an int (always at least
>> 32 bits in GCC) is fine as well, sure.
> 
> Let's have it as unsigned int then (in a separate thread I was
> suggesting to go even smaller).
> 
>> That easily fits one line.  (Many more examples here btw).
>>
>>> +bool
>>> +rs6000_fn_has_any_of_these_mask_bits (enum rs6000_builtins code,
>>> + HOST_WIDE_INT mask)
>>> +{
>>> +  gcc_assert (code < RS6000_BUILTIN_COUNT);
>>
>> We don't have this assert anywhere else, so lose it here as well?
>>
>> If we want such checking we should make an inline accessor function for
>> this, and check it there.  But we already do a check in
>> rs6000_builtin_decl (and one in def_builtin, but that one has an
>> off-by-one error in it).
>>
>>> +extern bool rs6000_fn_has_any_of_these_mask_bits (enum rs6000_builtins 
>>> code,
>>> + HOST_WIDE_INT mask);
>>
>> The huge unwieldy name suggests it might not be the best abstraction you
>> could use, btw ;-)
>>
>>> +static bool
>>> +rs6000_update_ipa_fn_target_info (unsigned int , const gimple *stmt)
>>> +{
>>> +  /* Assume inline asm can use any instruction features.  */
>>> +  if (gimple_code (stmt) == GIMPLE_ASM)
>>
>> This should be fine for HTM, but it may be a bit *too* pessimistic for
>> other features.  We'll see when we get there :-)
>>
>>> +@deftypefn {Target Hook} bool TARGET_NEED_IPA_FN_TARGET_INFO (const_tree 
>>> @var{decl}, unsigned int& @var{info})
>>> +Allow target to check early whether it is necessary to analyze all gimple
>>> +statements in the given function to update target specific information for
>>> +inlining.  See hook @code{update_ipa_fn_target_info} for usage example of
>> [ ... ]
>>> +The default version of this hook returns false.
>>
>> And that is really the only reason to have this premature optimisation:
>> targets that do not care do not have to pay the price, however trivial
>> that price may be, which is a good idea politically ;-)
>>
>>> +/* { dg-final { scan-tree-dump-times "Inlining foo/\[0-9\]* " 1 "einline"} 
>>> } */
>>
>> If you use {} instead of "" you don't need the backslashes.
>>
>>> +default_update_ipa_fn_target_info (uint16_t &, const gimple *)
>>
>> I'm surprised the compiler didn't warn about this btw.
>>
>> The rs6000 parts are okay for trunk (with the trivial cleanups please).
>> Thanks!
> 
>> +/* By default, return false to not need to collect any target information
>> +   for inlining.  Target maintainer should re-define the hook if the
>> +   target want to take advantage of it.  */
>> +
>> +bool
>> +default_need_ipa_fn_target_info (const_tree, uint16_t &)
>> +{
>> +  return false;
>> +}
>> +
>> +bool
>> +default_update_ipa_fn_target_info (uint16_t &, const gimple *)
>> +{
>> +  return false;
>> +}
>
> The parameters have uint16_t type here but you apparently decided to use
unsigned int everywhere else, you probably forgot to change them here
too.

Yeah, I did forget to change them. :(  Thanks for catching!

> the IPA bits are OK too (after the type mismatch is fixed).
> 

Thanks!

BR,
Kewen


Re: [PATCH v3] ipa-inline: Add target info into fn summary [PR102059]

2021-09-21 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the review!

on 2021/9/17 下午10:14, Segher Boessenkool wrote:
> On Fri, Sep 17, 2021 at 05:42:38PM +0800, Kewen.Lin wrote:
>> Against v2 [2], this v3 addressed Martin's review comments:
>>   - Replace HWI auto_vec with unsigned int for target_info
>> to avoid overkill (also Segher's comments), adjust some
>> places need to be updated for this change.
> 
> I'd have used a single HWI (always 64 bits), but an int (always at least
> 32 bits in GCC) is fine as well, sure.
> 
>>  * config/rs6000/rs6000-internal.h
>>  (rs6000_fn_has_any_of_these_mask_bits): New declare.
> 
> You can break that after the ":"...  Just :-)
> 

OK, will adjust.

>>  * doc/tm.texi.in (TARGET_UPDATE_IPA_FN_TARGET_INFO): Document new
>>  hook.
> 
> That easily fits one line.  (Many more examples here btw).
> 

I noticed some practices seem to like breaking it early, maybe due to
it could have a good overall alignment view?  I am guessing the demand
for where to break isn't that strict?

>> +bool
>> +rs6000_fn_has_any_of_these_mask_bits (enum rs6000_builtins code,
>> +  HOST_WIDE_INT mask)
>> +{
>> +  gcc_assert (code < RS6000_BUILTIN_COUNT);
> 
> We don't have this assert anywhere else, so lose it here as well?
> 
> If we want such checking we should make an inline accessor function for
> this, and check it there.  But we already do a check in
> rs6000_builtin_decl (and one in def_builtin, but that one has an
> off-by-one error in it).

OK, will remove it.

> 
>> +extern bool rs6000_fn_has_any_of_these_mask_bits (enum rs6000_builtins code,
>> +  HOST_WIDE_INT mask);
> 
> The huge unwieldy name suggests it might not be the best abstraction you
> could use, btw ;-)
> 
>> +static bool
>> +rs6000_update_ipa_fn_target_info (unsigned int , const gimple *stmt)
>> +{
>> +  /* Assume inline asm can use any instruction features.  */
>> +  if (gimple_code (stmt) == GIMPLE_ASM)
> 
> This should be fine for HTM, but it may be a bit *too* pessimistic for
> other features.  We'll see when we get there :-)
> 
>> +@deftypefn {Target Hook} bool TARGET_NEED_IPA_FN_TARGET_INFO (const_tree 
>> @var{decl}, unsigned int& @var{info})
>> +Allow target to check early whether it is necessary to analyze all gimple
>> +statements in the given function to update target specific information for
>> +inlining.  See hook @code{update_ipa_fn_target_info} for usage example of
> [ ... ]
>> +The default version of this hook returns false.
> 
> And that is really the only reason to have this premature optimisation:
> targets that do not care do not have to pay the price, however trivial
> that price may be, which is a good idea politically ;-)
> 
>> +/* { dg-final { scan-tree-dump-times "Inlining foo/\[0-9\]* " 1 "einline"} 
>> } */
> 
> If you use {} instead of "" you don't need the backslashes.
> 

OK, will adjust.

>> +default_update_ipa_fn_target_info (uint16_t &, const gimple *)
> 
> I'm surprised the compiler didn't warn about this btw.
> 

So am I.  Guessing it's due to I only bootstrapped it on ppc64le which 
re-defines
the hook?  Anyway, I will do the testing on x86 as well before committing it.

> The rs6000 parts are okay for trunk (with the trivial cleanups please).
> Thanks!
> 

Thanks again!
BR,
Kewen


Re: [PATCH] rs6000: Parameterize some const values for density test

2021-09-20 Thread Kewen.Lin via Gcc-patches
on 2021/9/18 上午6:26, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Sep 15, 2021 at 04:52:49PM +0800, Kewen.Lin wrote:
>> This patch follows the discussion here[1], where Segher suggested
>> parameterizing those exact magic constants for density heuristics,
>> to make it easier to tweak if need.
>>
>> Since these heuristics are quite internal, I make these parameters
>> as undocumented and be mainly used by developers.
> 
> Okido.
> 
>> +  if (data->nloads > (unsigned int) rs6000_density_load_num_threshold
>> +  && load_pct > (unsigned int) rs6000_density_load_pct_threshold)
> 
> Those variables are unsigned int already.  Don't cast please.
> 

Unfortunately this is required by bootstrapping.  The UInteger for the
param definition is really confusing, in the underlying implementation
it's still "signed".  If you grep "(unsigned) param", you can see a few
examples.  I guess the "UInteger" is mainly for the param value range
checking.

>> +-param=rs6000-density-pct-threshold=
>> +Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) 
>> Init(85) IntegerRange(0, 99) Param
> 
> So make this and all other percentages (0, 100) please.
> 

I thought 99 is enough for the RHS in ">". just realized it's more clear
with 100.  Will fix!

>> +When costing for loop vectorization, we probably need to penalize the loop 
>> body cost if the existing cost model may not adequately reflect delays from 
>> unavailable vector resources.  We collect the cost for vectorized statements 
>> and non-vectorized statements separately, check the proportion of vec_cost 
>> to total cost of vec_cost and non vec_cost, and penalize only if the 
>> proportion exceeds the threshold specified by this parameter.  The default 
>> value is 85.
> 
> It would be good if we can use line breaks in the source code for things
> like this, but I don't think we can.  This message is mainly used for
> "--help=param", and it is good there to have as short messages as you
> can.  But given the nature of params you need quite a few words often,
> and you do not want to say so little that things are no clear, either.
> 
> So, dunno :-)

I did some testings, the line breaks writing can still survive in the
"--help=param" show, the lines are concatenated with " ".  Although
there seems no this kind of writing practices, I am guessing you want
me to do line breaks for their descriptions?  If so, I will make them
short as the above "Target Undocumented..." line.  Or do you want it
to align source code ColumnLimit 80 (for these cases, it would look
shorter)?

> 
> Oksy for trunk with these fixes and what Bill mentioned in the other
> thread.  Thanks!
> 

OK, thanks again!

BR,
Kewen


Re: [PATCH] rs6000: Parameterize some const values for density test

2021-09-20 Thread Kewen.Lin via Gcc-patches
Hi Bill,

Thanks for the review!

on 2021/9/18 上午12:27, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 9/15/21 3:52 AM, Kewen.Lin wrote:
>> Hi,
>>
>> This patch follows the discussion here[1], where Segher suggested
>> parameterizing those exact magic constants for density heuristics,
>> to make it easier to tweak if need.
>>
>> Since these heuristics are quite internal, I make these parameters
>> as undocumented and be mainly used by developers.
>>
>> The change here should be "No Functional Change".  But I verified
>> it with SPEC2017 at option sets O2-vect and Ofast-unroll on Power8,
>> the result is neutral as expected.
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000.opt (rs6000-density-pct-threshold,
>> rs6000-density-size-threshold, rs6000-density-penalty,
>> rs6000-density-load-pct-threshold,
>> rs6000-density-load-num-threshold): New parameter.
>> * config/rs6000/rs6000.c (rs6000_density_test): Adjust with
>> corresponding parameters.
>>
>> ---
>>   gcc/config/rs6000/rs6000.c   | 22 +++---
>>   gcc/config/rs6000/rs6000.opt | 21 +
>>   2 files changed, 28 insertions(+), 15 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 9bc826e3a50..4ab23b0ab33 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5284,9 +5284,6 @@ struct rs6000_cost_data
>>   static void
>>   rs6000_density_test (rs6000_cost_data *data)
>>   {
>> -  const int DENSITY_PCT_THRESHOLD = 85;
>> -  const int DENSITY_SIZE_THRESHOLD = 70;
>> -  const int DENSITY_PENALTY = 10;
>>     struct loop *loop = data->loop_info;
>>     basic_block *bbs = get_loop_body (loop);
>>     int nbbs = loop->num_nodes;
>> @@ -5322,26 +5319,21 @@ rs6000_density_test (rs6000_cost_data *data)
>>     free (bbs);
>>     density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>>
>> -  if (density_pct > DENSITY_PCT_THRESHOLD
>> -  && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD)
>> +  if (density_pct > rs6000_density_pct_threshold
>> +  && vec_cost + not_vec_cost > rs6000_density_size_threshold)
>>   {
>> -  data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
>> +  data->cost[vect_body] = vec_cost * (100 + rs6000_density_penalty) / 
>> 100;
>>     if (dump_enabled_p ())
>>   dump_printf_loc (MSG_NOTE, vect_location,
>>    "density %d%%, cost %d exceeds threshold, penalizing "
>> - "loop body cost by %d%%\n", density_pct,
>> - vec_cost + not_vec_cost, DENSITY_PENALTY);
>> + "loop body cost by %u%%\n", density_pct,
>> + vec_cost + not_vec_cost, rs6000_density_penalty);
>>   }
>>
>>     /* Check whether we need to penalize the body cost to account
>>    for excess strided or elementwise loads.  */
>>     if (data->extra_ctor_cost > 0)
>>   {
>> -  /* Threshold for load stmts percentage in all vectorized stmts.  */
>> -  const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> -  /* Threshold for total number of load stmts.  */
>> -  const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> -
>>     gcc_assert (data->nloads <= data->nstmts);
>>     unsigned int load_pct = (data->nloads * 100) / data->nstmts;
>>
>> @@ -5355,8 +5347,8 @@ rs6000_density_test (rs6000_cost_data *data)
>>     the loads.
>>    One typical case is the innermost loop of the hotspot of SPEC2017
>>    503.bwaves_r without loop interchange.  */
>> -  if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>> -  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>> +  if (data->nloads > (unsigned int) rs6000_density_load_num_threshold
>> +  && load_pct > (unsigned int) rs6000_density_load_pct_threshold)
>>   {
>>     data->cost[vect_body] += data->extra_ctor_cost;
>>     if (dump_enabled_p ())
>> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
>> index 0538db387dc..563983f3269 100644
>> --- a/gcc/config/rs6000/rs6000.opt
>> +++ b/gcc/config/rs6000/rs6000.opt
>

Re: [PATCH] rs6000: Modify the way for extra penalized cost

2021-09-20 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the review!

on 2021/9/18 上午6:01, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
>> The way with nunits * stmt_cost can get one much exaggerated
>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>> that's why we need one bound.  To make it scale, this patch
>> doesn't use nunits * stmt_cost any more, but it still keeps
>> nunits since there are actually nunits scalar loads there.  So
>> it uses one cost adjusted from stmt_cost, since the current
>> stmt_cost sort of considers nunits, we can stablize the cost
>> for big nunits and retain the cost for small nunits.  After
>> some tries, this patch gets the adjusted cost as:
>>
>> stmt_cost / (log2(nunits) * log2(nunits))
> 
> So for  V16QI it gives *16/(4*4) so *1
> V8HI  it gives *8/(3*3)  so *8/9
> V4SI  it gives *4/(2*2)  so *1
> V2DI  it gives *2/(1*1)  so *2
> and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
> just gives wildly wrong answers; the div returns 0 on recent systems).
> 

I don't expected we will have V1TI for strided/elementwise load,
if it's one unit vector, it's the whole vector itself.
Besides, the below assertion should exclude it already.

>> For V16QI, the adjusted cost would be 1 and total penalized
>> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
>> cost would be 2 and total penalized cost is 4, which is the
>> same as before.  btw, I tried to use one single log2(nunits),
>> but the penalized cost is still big enough and can't fix the
>> degraded bmk blender_r.
> 
> Does it make sense to treat V2DI (and V2DF) as twice more expensive than
> other vectors, which are all pretty much equal cost (except those that
> end up with cost 0)?  If so, there are simpler ways to do that.
> 

Yeah, from the SPEC2017 evaluation, it's good with this.  The costing
framework of vectorization doesn't consider the dependent insn chain
and available #unit etc. like local scheduling (it can't either), so
we have to use some heuristics to handle some special cases.  For more
units vector construction, the used instructions are more.  It has more
chances to schedule them better (even run in parallelly when enough
available units at the time), so we don't need to penalize more for them.
For V2DI, the load result is fed into construction directly, the current
stmt_cost is to consider merging and only 2, penalizing it with one is
not enough from the bwaves experiment.

>> +  int nunits_log2 = exact_log2 (nunits);
>> +  gcc_assert (nunits_log2 > 0);
>> +  unsigned int nunits_sq = nunits_log2 * nunits_log2;
> 
>> = 0
> 
> This of course is assuming nunits will always be a power of 2, but I'm
> sure that we have many other places in the compiler assuming that
> already, so that is fine.  And if one day this stops being true we will
> get a nice ICE, pretty much the best we could hope for.
> 

Yeah, exact_log2 returns -1 for non power of 2 input, for example:

input output
0 ->-1
1 ->0
2 ->1
3 ->-1

>> +  unsigned int adjusted_cost = stmt_cost / nunits_sq;
> 
> But this can divide by 0.  Or are we somehow guaranteed that nunits
> will never be 1?  Yes the log2 check above, sure, but that ICEs if this
> is violated; is there anything that actually guarantees it is true?
> 

As I mentioned above, I don't expect we can have nunits 1 strided/ew load,
and the ICE should check this and ensure dividing by zero never happens.  :)

>> +  gcc_assert (adjusted_cost > 0);
> 
> I don't see how you guarantee this, either.
> 

It's mainly to prevent that one day we tweak the cost for construction
in rs6000_builtin_vectorization_cost then make some unexpected values
generated here.  But now these expected values are guaranteed as the
current costs and the formula.

> 
> A magic crazy formula like this is no good.  If you want to make the
> cost of everything but V2D* be the same, and that of V2D* be twice that,
> that is a weird heuristic, but we can live with that perhaps.  But that
> beats completely unexplained (and unexplainable) magic!
> 
> Sorry.
> 

That's all right, thanks for the comments!  let's improve it.  :)
How about just assigning 2 for V2DI and 1 for the others for the
penalized_cost_per_load with some detailed commentary, it should have
the same effect with this "magic crazy formula", but I guess it can
be more clear.

BR,
Kewen


Re: [PATCH] rs6000: Modify the way for extra penalized cost

2021-09-20 Thread Kewen.Lin via Gcc-patches
Hi Bill,

Thanks for the review!

on 2021/9/18 上午12:34, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 9/15/21 8:14 PM, Kewen.Lin wrote:
>> Hi,
>>
>> This patch follows the discussion here[1], where Segher pointed
>> out the existing way to guard the extra penalized cost for
>> strided/elementwise loads with a magic bound doesn't scale.
>>
>> The way with nunits * stmt_cost can get one much exaggerated
>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>> that's why we need one bound.  To make it scale, this patch
>> doesn't use nunits * stmt_cost any more, but it still keeps
>> nunits since there are actually nunits scalar loads there.  So
>> it uses one cost adjusted from stmt_cost, since the current
>> stmt_cost sort of considers nunits, we can stablize the cost
>> for big nunits and retain the cost for small nunits.  After
>> some tries, this patch gets the adjusted cost as:
>>
>>  stmt_cost / (log2(nunits) * log2(nunits))
>>
>> For V16QI, the adjusted cost would be 1 and total penalized
>> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
>> cost would be 2 and total penalized cost is 4, which is the
>> same as before.  btw, I tried to use one single log2(nunits),
>> but the penalized cost is still big enough and can't fix the
>> degraded bmk blender_r.
>>
>> The separated SPEC2017 evaluations on Power8, Power9 and Power10
>> at option sets O2-vect and Ofast-unroll showed this change is
>> neutral (that is same effect as before).
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>> the way to compute extra penalized cost.
>>
>> ---
>>   gcc/config/rs6000/rs6000.c | 28 +---
>>   1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 4ab23b0ab33..e08b94c0447 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
>> *data,
>>   {
>>     tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>     unsigned int nunits = vect_nunits_for_cost (vectype);
>> -  unsigned int extra_cost = nunits * stmt_cost;
>> -  /* As function rs6000_builtin_vectorization_cost shows, we have
>> - priced much on V16QI/V8HI vector construction as their units,
>> - if we penalize them with nunits * stmt_cost, it can result in
>> - an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>> - is 20 and nunits is 16, the extra cost is 320 which looks
>> - much exaggerated.  So let's use one maximum bound for the
>> - extra penalized cost for vector construction here.  */
>> -  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> -  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> -    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +  /* As function rs6000_builtin_vectorization_cost shows, we
>> + have priced much on V16QI/V8HI vector construction by
>> + considering their units, if we penalize them with nunits
>> + * stmt_cost here, it can result in an unreliable body cost,
> 
> This might be confusing to the reader, since you have deleted the calculation 
> of nunits * stmt_cost.  Could you instead write this to indicate that we used 
> to adjust in this way, and it had this particular downside, so that's why 
> you're choosing this heuristic? It's a minor thing but I think people reading 
> the code will be confused otherwise.
> 

Good point!  I'll update the commentary to explain it, thanks!!

BR,
Kewen 

> I think the heuristic is generally reasonable, and certainly better than what 
> we had before!
> 
> LGTM with adjusted commentary, so recommend maintainers approve.
> 
> Thanks for the patch!
> Bill
>> + eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16,
>> + the penalty will be 320 which looks much exaggerated.  But
>> + there are actually nunits scalar loads, so we try to adopt
>> + one reasonable penalized cost for each load rather than
>> + stmt_cost.  Here, with stmt_cost dividing by log2(nunits)^2,
>> + we can still retain the necessary penalty for small nunits
>> + meanwhile stabilize the penalty for big nunits.  */
>> +  int nunits_log2 = exact_log2 (nunits);
>> +  gcc_assert (nunits_log2 > 0);
>> +  unsigned int nunits_sq = nunits_log2 * nunits_log2;
>> +  unsigned int adjusted_cost = stmt_cost / nunits_sq;
>> +  gcc_assert (adjusted_cost > 0);
>> +  unsigned int extra_cost = nunits * adjusted_cost;
>>     data->extra_ctor_cost += extra_cost;
>>   }
>>   }
>> -- 
>> 2.25.1
> 


Re: [PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-20 Thread Kewen.Lin via Gcc-patches
Hi Martin,

on 2021/9/17 下午7:26, Martin Jambor wrote:
> Hi,
> 
> On Fri, Sep 17 2021, Kewen.Lin wrote:
>> on 2021/9/16 下午9:19, Martin Jambor wrote:
>>> On Thu, Sep 16 2021, Kewen.Lin wrote:
>>>> on 2021/9/15 下午8:51, Martin Jambor wrote:
>>>>> On Wed, Sep 08 2021, Kewen.Lin wrote:
>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>> diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h
>>>>>> index 78399b0b9bb..300b8da4507 100644
>>>>>> --- a/gcc/ipa-fnsummary.h
>>>>>> +++ b/gcc/ipa-fnsummary.h
>>>>>> @@ -193,6 +194,9 @@ public:
>>>>>>vec *loop_strides;
>>>>>>/* Parameters tested by builtin_constant_p.  */
>>>>>>vec GTY((skip)) builtin_constant_p_parms;
>>>>>> +  /* Like fp_expressions, but it's to hold some target specific 
>>>>>> information,
>>>>>> + such as some target specific isa flags.  */
>>>>>> +  auto_vec GTY((skip)) target_info;
>>>>>>/* Estimated growth for inlining all copies of the function before 
>>>>>> start
>>>>>>   of small functions inlining.
>>>>>>   This value will get out of date as the callers are duplicated, but
>>>>>
>>>>> Segher already wrote in the first thread that a vector of HOST_WIDE_INTs
>>>>> is an overkill and I agree.  So at least make the new field just a
>>>>> HOST_WIDE_INT or better yet, an unsigned int.  But I would even go
>>>>> further and make target_info only a 16-bit bit-field, place it after the
>>>>> other bit-fields in class ipa_fn_summary and pass it to the hooks as
>>>>> uint16_t.  Unless you have plans which require more space, I think we
>>>>> should be conservative here.
>>>>>
>>>>
>>>> OK, yeah, the consideration is mainly for the scenario that target has
>>>> a few bits to care about.  I just realized that to avoid inefficient
>>>> bitwise operation for mapping target info bits to isa_flag bits, target
>>>> can rearrange the sparse bits in isa_flag, so it's not a deal.
>>>> Thanks for re-raising this!  I'll use the 16 bits bit-field in v3 as you
>>>> suggested, if you don't mind, I will put it before the existing bit-fields
>>>> to have a good alignment.
>>>
>>> All right.
>>>
>>
>> Sorry that I failed to use 16 bit-fields for this, I figured out that
>> the bit-fields can not be address-taken or passed as non-const reference.
>> The gentype also failed to recognize uint16_t if I used uint16_t directly
>> in ipa-fnsummary.h.  Finally I used unsigned int instead.
>>
> 
> well, you could have used:
> 
>   unsigned int target_info : 16;
> 
> for the field (and uint16_t when passed to hooks).
> 
> But I am not sure if it is that crucial.
> 

I may miss something, specifically I tried with:

1)

  unsigned int target_info : 16;
  unsigned inlinable : 1;
  ...

  update_ipa_fn_target_info (uint16_t &, const gimple *)

2)

  unsigned int target_info : 16;
  unsigned inlinable : 1;
  ...

  update_ipa_fn_target_info (uint16_t *, const gimple *)

The above two ways failed due to:

"Because bit fields do not necessarily begin at the beginning of a byte,
address of a bit field cannot be taken. Pointers and non-const references
to bit fields are not possible." as [1].

Although we can change the hook prototype to

  bool update_ipa_fn_target_info (const uint16_t, const gimple*, uint16_t&)

or

  uint16_t update_ipa_fn_target_info (const uint16_t, const gimple*, bool&)

to workaround bit field limitation, it looks weird and inefficient.

3)

  ...
  unsigned int fp_expressions : 1;
  uint16_t target_info;

  update_ipa_fn_target_info (uint16_t &, const gimple *)

it fails due to gengtype erroring:

gcc/ipa-fnsummary.h:171: undefined type `uint16_t'
gengtype: didn't write state file tmp-gtype.state after errors


Then I gave up and guessed it's not so crucial like you said, 
and used unsigned int instead. :)

[1] https://en.cppreference.com/w/cpp/language/bit_field

BR,
Kewen


Re: [PATCH] ipa-fnsummary: Remove inconsistent bp_pack_value

2021-09-20 Thread Kewen.Lin via Gcc-patches
Hi Richi,

Thanks for the review!

on 2021/9/17 下午6:04, Richard Biener wrote:
> On Fri, Sep 17, 2021 at 12:03 PM Richard Biener
>  wrote:
>>
>> On Fri, Sep 17, 2021 at 11:43 AM Kewen.Lin  wrote:
>>>
>>> Hi,
>>>
>>> When changing target_info with bitfield, I happened to find this
>>> inconsistent streaming in and out.  We have the streaming in:
>>>
>>>   bp_pack_value (, info->inlinable, 1);
>>>   bp_pack_value (, false, 1);
>>>   bp_pack_value (, info->fp_expressions, 1);
>>>
>>> while the streaming out:
>>>
>>>   info->inlinable = bp_unpack_value (, 1);
>>>   info->fp_expressions = bp_unpack_value (, 1)
>>>
>>> The cleanup of Cilk Plus support seemed to miss to remove the bit
>>> streaming out but change with streaming false.
>>>
>>> By hacking fp_expression_p to return true always, I can see it
>>> reads the wrong fp_expressions value (false) out in wpa dumping.
>>>
>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>
>>> Is it ok for trunk?
>>
>> OK for trunk and all affected branches (note we need to bump the
>> LTO minor version there).  The issue comes from the removal
>> of cilk+ in r8-4956 which removed the bp_unpack but replaced
>> the bp_pack ...
>>
>> It's a correctness issue as we'll read fp_expressions as always 'false'
> 
> Btw, on branches we could also simply unpack a dummy bit to avoid
> changing the format.
> 

Committed in r12-3721.  Thanks!

As suggested, the patch for branches is listed below.

Is ok for branches 9, 10 and 11 after some trunk burn in time?

BR,
Kewen
-
gcc/ChangeLog:

* ipa-fnsummary.c (inline_read_section): Unpack a dummy bit
to keep consistent with the side of streaming out.

---
diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c
index 18bbae145b9..bf635c1f78a 100644
--- a/gcc/ipa-fnsummary.c
+++ b/gcc/ipa-fnsummary.c
@@ -4403,13 +4403,20 @@ inline_read_section (struct lto_file_decl_data 
*file_data, const char *data,
   bp = streamer_read_bitpack ();
   if (info)
{
-  info->inlinable = bp_unpack_value (, 1);
-  info->fp_expressions = bp_unpack_value (, 1);
+ info->inlinable = bp_unpack_value (, 1);
+ /* On the side of streaming out, there is still one bit
+streamed out between inlinable and fp_expressions bits,
+which was used for cilk+ before but now always false.
+To remove the bit packing need to bump LTO minor version,
+so unpack a dummy bit here to keep consistent instead.  */
+ bp_unpack_value (, 1);
+ info->fp_expressions = bp_unpack_value (, 1);
}
   else
{
-  bp_unpack_value (, 1);
-  bp_unpack_value (, 1);
+ bp_unpack_value (, 1);
+ bp_unpack_value (, 1);
+ bp_unpack_value (, 1);
}

   count2 = streamer_read_uhwi ();



Re: [PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-17 Thread Kewen.Lin via Gcc-patches
Hi Martin,

on 2021/9/16 下午9:19, Martin Jambor wrote:
> Hi,
> 
> On Thu, Sep 16 2021, Kewen.Lin wrote:
>> Hi Martin,
>>
>> Thanks for the review comments!
>>
>> on 2021/9/15 下午8:51, Martin Jambor wrote:
>>> Hi,
>>>
>>> since this is inlining-related, I would somewhat prefer Honza to have a
>>> look too, but I have the following comments:
>>>
>>> On Wed, Sep 08 2021, Kewen.Lin wrote:
>>>>
>>>
>>> [...]
>>>
>>>> diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h
>>>> index 78399b0b9bb..300b8da4507 100644
>>>> --- a/gcc/ipa-fnsummary.h
>>>> +++ b/gcc/ipa-fnsummary.h
>>>> @@ -193,6 +194,9 @@ public:
>>>>vec *loop_strides;
>>>>/* Parameters tested by builtin_constant_p.  */
>>>>vec GTY((skip)) builtin_constant_p_parms;
>>>> +  /* Like fp_expressions, but it's to hold some target specific 
>>>> information,
>>>> + such as some target specific isa flags.  */
>>>> +  auto_vec GTY((skip)) target_info;
>>>>/* Estimated growth for inlining all copies of the function before start
>>>>   of small functions inlining.
>>>>   This value will get out of date as the callers are duplicated, but
>>>
>>> Segher already wrote in the first thread that a vector of HOST_WIDE_INTs
>>> is an overkill and I agree.  So at least make the new field just a
>>> HOST_WIDE_INT or better yet, an unsigned int.  But I would even go
>>> further and make target_info only a 16-bit bit-field, place it after the
>>> other bit-fields in class ipa_fn_summary and pass it to the hooks as
>>> uint16_t.  Unless you have plans which require more space, I think we
>>> should be conservative here.
>>>
>>
>> OK, yeah, the consideration is mainly for the scenario that target has
>> a few bits to care about.  I just realized that to avoid inefficient
>> bitwise operation for mapping target info bits to isa_flag bits, target
>> can rearrange the sparse bits in isa_flag, so it's not a deal.
>> Thanks for re-raising this!  I'll use the 16 bits bit-field in v3 as you
>> suggested, if you don't mind, I will put it before the existing bit-fields
>> to have a good alignment.
> 
> All right.
> 

Sorry that I failed to use 16 bit-fields for this, I figured out that
the bit-fields can not be address-taken or passed as non-const reference.
The gentype also failed to recognize uint16_t if I used uint16_t directly
in ipa-fnsummary.h.  Finally I used unsigned int instead.


>>>>/* When optimizing and analyzing for IPA inliner, initialize loop 
>>>> optimizer
>>>>   so we can produce proper inline hints.
>>>> @@ -2659,6 +2669,12 @@ analyze_function_body (struct cgraph_node *node, 
>>>> bool early)
>>>>   bb_predicate,
>>>>   bb_predicate);
>>>>  
>>>> +  /* Only look for target information for inlinable functions.  */
>>>> +  bool scan_for_target_info =
>>>> +info->inlinable
>>>> +&& targetm.target_option.need_ipa_fn_target_info (node->decl,
>>>> +info->target_info);
>>>> +
>>>>if (fbi.info)
>>>>  compute_bb_predicates (, node, info, params_summary);
>>>>const profile_count entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
>>>> @@ -2876,6 +2892,10 @@ analyze_function_body (struct cgraph_node *node, 
>>>> bool early)
>>>>  if (dump_file)
>>>>fprintf (dump_file, "   fp_expression set\n");
>>>>}
>>>> +if (scan_for_target_info)
>>>> +  scan_for_target_info =
>>>> +targetm.target_option.update_ipa_fn_target_info
>>>> +(info->target_info, stmt);
>>>>}
>>>
>>> Practically it probably does not matter, but why is this in the "if
>>> (this_time || this_size)" block?  Although I can see that setting
>>> fp_expression is also done that way... but it seems like copying a
>>> mistake to me.
>>
>> Yeah, I felt target info scanning is similar to fp_expression scanning,
>> so I just followed the same way.  If I read it right, the case
>> !(this_time || this_size) means the STMT won't be weighted to any RTL
>> insn from both time and size perspectives, so guarding it seems to avoid
&

[PATCH] ipa-fnsummary: Remove inconsistent bp_pack_value

2021-09-17 Thread Kewen.Lin via Gcc-patches
Hi,

When changing target_info with bitfield, I happened to find this
inconsistent streaming in and out.  We have the streaming in:

  bp_pack_value (, info->inlinable, 1);
  bp_pack_value (, false, 1);
  bp_pack_value (, info->fp_expressions, 1);

while the streaming out:

  info->inlinable = bp_unpack_value (, 1);
  info->fp_expressions = bp_unpack_value (, 1)

The cleanup of Cilk Plus support seemed to miss to remove the bit
streaming out but change with streaming false.

By hacking fp_expression_p to return true always, I can see it
reads the wrong fp_expressions value (false) out in wpa dumping.

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* ipa-fnsummary.c (ipa_fn_summary_write): Remove inconsistent
bitfield streaming out.

diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c
index 2470937460f..31199919405 100644
--- a/gcc/ipa-fnsummary.c
+++ b/gcc/ipa-fnsummary.c
@@ -4652,7 +4652,6 @@ ipa_fn_summary_write (void)
   info->time.stream_out (ob);
   bp = bitpack_create (ob->main_stream);
   bp_pack_value (, info->inlinable, 1);
-  bp_pack_value (, false, 1);
   bp_pack_value (, info->fp_expressions, 1);
   streamer_write_bitpack ();
   streamer_write_uhwi (ob, vec_safe_length (info->conds));


[PATCH v3] ipa-inline: Add target info into fn summary [PR102059]

2021-09-17 Thread Kewen.Lin via Gcc-patches
Hi!

Power ISA 2.07 (Power8) introduces transactional memory feature
but ISA3.1 (Power10) removes it.  It exposes one troublesome
issue as PR102059 shows.  Users define some function with
target pragma cpu=power10 then it calls one function with
attribute always_inline which inherits command line option
cpu=power8 which enables HTM implicitly.  The current isa_flags
check doesn't allow this inlining due to "target specific
option mismatch" and error mesasge is emitted.

Normally, the callee function isn't intended to exploit HTM
feature, but the default flag setting make it look it has.
As Richi raised in the PR, we have fp_expressions flag in
function summary, and allow us to check the function actually
contains any floating point expressions to avoid overkill.
So this patch follows the similar idea but is more target
specific, for this rs6000 port specific requirement on HTM
feature check, we would like to check rs6000 specific HTM
built-in functions and inline assembly, it allows targets
to do their own customized checks and updates.

It introduces two target hooks need_ipa_fn_target_info and
update_ipa_fn_target_info.  The former allows target to do
some previous check and decides to collect target specific
information for this function or not.  For some special case,
it can predict the analysis result and set it early without
any scannings.  The latter allows the analyze_function_body
to pass gimple stmts down just like fp_expressions handlings,
target can do its own tricks.  I put them as one hook initially
with one boolean to indicates whether it's initial time, but
the code looks a bit ugly, to separate them seems to have
better readability.

Against v2 [2], this v3 addressed Martin's review comments:
  - Replace HWI auto_vec with unsigned int for target_info
to avoid overkill (also Segher's comments), adjust some
places need to be updated for this change.
  - Annotate target_info won't be streamed for offloading
target compilers.
  - Scan for all gimple statements instead of those with
non-zero size/time weights.

Against v1 [1], the v2 addressed Richi's and Segher's review
comments, mainly consists of:
  - Extend it to cover non always_inline.
  - Exclude the case for offload streaming.
  - Some function naming and formatting issues.
  - Adjust rs6000_can_inline_p.
  - Add new cases.

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Any comments are highly appreciated!

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578555.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579045.html
--
gcc/ChangeLog:

PR ipa/102059
* config/rs6000/rs6000-call.c (rs6000_fn_has_any_of_these_mask_bits):
New function.
* config/rs6000/rs6000-internal.h
(rs6000_fn_has_any_of_these_mask_bits): New declare.
* config/rs6000/rs6000.c (TARGET_NEED_IPA_FN_TARGET_INFO): New macro.
(TARGET_UPDATE_IPA_FN_TARGET_INFO): Likewise.
(rs6000_need_ipa_fn_target_info): New function.
(rs6000_update_ipa_fn_target_info): Likewise.
(rs6000_can_inline_p): Adjust for ipa function summary target info.
* config/rs6000/rs6000.h (RS6000_FN_TARGET_INFO_HTM): New macro.
* ipa-fnsummary.c (ipa_dump_fn_summary): Adjust for ipa function
summary target info.
(analyze_function_body): Adjust for ipa function summary target
info and call hook rs6000_need_ipa_fn_target_info and
rs6000_update_ipa_fn_target_info.
(ipa_merge_fn_summary_after_inlining): Adjust for ipa function
summary target info.
(inline_read_section): Likewise.
(ipa_fn_summary_write): Likewise.
* ipa-fnsummary.h (ipa_fn_summary::target_info): New member.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_UPDATE_IPA_FN_TARGET_INFO): Document new
hook.
(TARGET_NEED_IPA_FN_TARGET_INFO): Likewise.
* target.def (update_ipa_fn_target_info): New hook.
(need_ipa_fn_target_info): Likewise.
* targhooks.c (default_need_ipa_fn_target_info): New function.
(default_update_ipa_fn_target_info): Likewise.
* targhooks.h (default_update_ipa_fn_target_info): New declare.
(default_need_ipa_fn_target_info): Likewise.

gcc/testsuite/ChangeLog:

PR ipa/102059
* gcc.dg/lto/pr102059-1_0.c: New test.
* gcc.dg/lto/pr102059-1_1.c: New test.
* gcc.dg/lto/pr102059-1_2.c: New test.
* gcc.dg/lto/pr102059-2_0.c: New test.
* gcc.dg/lto/pr102059-2_1.c: New test.
* gcc.dg/lto/pr102059-2_2.c: New test.
* gcc.target/powerpc/pr102059-5.c: New test.
* gcc.target/powerpc/pr102059-6.c: New test.
* gcc.target/powerpc/pr102059-7.c: New test.

---
 gcc/config/rs6000/rs6000-call.c   | 12 +++
 gcc/config/rs6000/rs6000-internal.h   |  3 +
 gcc/config/rs6000/rs6000.c| 87 +--
 

Re: [PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-15 Thread Kewen.Lin via Gcc-patches
Hi Martin,

Thanks for the review comments!

on 2021/9/15 下午8:51, Martin Jambor wrote:
> Hi,
> 
> since this is inlining-related, I would somewhat prefer Honza to have a
> look too, but I have the following comments:
> 
> On Wed, Sep 08 2021, Kewen.Lin wrote:
>>
> 
> [...]
> 
>> diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h
>> index 78399b0b9bb..300b8da4507 100644
>> --- a/gcc/ipa-fnsummary.h
>> +++ b/gcc/ipa-fnsummary.h
>> @@ -193,6 +194,9 @@ public:
>>vec *loop_strides;
>>/* Parameters tested by builtin_constant_p.  */
>>vec GTY((skip)) builtin_constant_p_parms;
>> +  /* Like fp_expressions, but it's to hold some target specific information,
>> + such as some target specific isa flags.  */
>> +  auto_vec GTY((skip)) target_info;
>>/* Estimated growth for inlining all copies of the function before start
>>   of small functions inlining.
>>   This value will get out of date as the callers are duplicated, but
> 
> Segher already wrote in the first thread that a vector of HOST_WIDE_INTs
> is an overkill and I agree.  So at least make the new field just a
> HOST_WIDE_INT or better yet, an unsigned int.  But I would even go
> further and make target_info only a 16-bit bit-field, place it after the
> other bit-fields in class ipa_fn_summary and pass it to the hooks as
> uint16_t.  Unless you have plans which require more space, I think we
> should be conservative here.
> 

OK, yeah, the consideration is mainly for the scenario that target has
a few bits to care about.  I just realized that to avoid inefficient
bitwise operation for mapping target info bits to isa_flag bits, target
can rearrange the sparse bits in isa_flag, so it's not a deal.
Thanks for re-raising this!  I'll use the 16 bits bit-field in v3 as you
suggested, if you don't mind, I will put it before the existing bit-fields
to have a good alignment.

> I am also not sure if I agree that the field should not be streamed for
> offloading, but since we do not have an offloading compiler needing them
> I guess for now that is OK. But it should be documented in the comment
> describing the field that it is not streamed to offloading compilers.
> 

Good point, will add it in v3.

> [...]
> 
> 
>> diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c
>> index 2470937460f..72091b6193f 100644
>> --- a/gcc/ipa-fnsummary.c
>> +++ b/gcc/ipa-fnsummary.c
>> @@ -2608,6 +2617,7 @@ analyze_function_body (struct cgraph_node *node, bool 
>> early)
>>info->conds = NULL;
>>info->size_time_table.release ();
>>info->call_size_time_table.release ();
>> +  info->target_info.release();
>>  
>>/* When optimizing and analyzing for IPA inliner, initialize loop 
>> optimizer
>>   so we can produce proper inline hints.
>> @@ -2659,6 +2669,12 @@ analyze_function_body (struct cgraph_node *node, bool 
>> early)
>> bb_predicate,
>> bb_predicate);
>>  
>> +  /* Only look for target information for inlinable functions.  */
>> +  bool scan_for_target_info =
>> +info->inlinable
>> +&& targetm.target_option.need_ipa_fn_target_info (node->decl,
>> +  info->target_info);
>> +
>>if (fbi.info)
>>  compute_bb_predicates (, node, info, params_summary);
>>const profile_count entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
>> @@ -2876,6 +2892,10 @@ analyze_function_body (struct cgraph_node *node, bool 
>> early)
>>if (dump_file)
>>  fprintf (dump_file, "   fp_expression set\n");
>>  }
>> +  if (scan_for_target_info)
>> +scan_for_target_info =
>> +  targetm.target_option.update_ipa_fn_target_info
>> +  (info->target_info, stmt);
>>  }
> 
> Practically it probably does not matter, but why is this in the "if
> (this_time || this_size)" block?  Although I can see that setting
> fp_expression is also done that way... but it seems like copying a
> mistake to me.

Yeah, I felt target info scanning is similar to fp_expression scanning,
so I just followed the same way.  If I read it right, the case
!(this_time || this_size) means the STMT won't be weighted to any RTL
insn from both time and size perspectives, so guarding it seems to avoid
unnecessary scannings.  I assumed that target bifs and inline asm would
not be evaluated as zero cost, it seems safe so far for HTM usage.

Do you worry about some special STMT which is weighted to zero but it's
necessarily to be checked for target info in a long term?
If so, I'll move it out in v3.
> 
> All that said, the overall approach seems correct to me.
> 

Thanks again.
BR,
Kewen


[PATCH] rs6000: Modify the way for extra penalized cost

2021-09-15 Thread Kewen.Lin via Gcc-patches
Hi,

This patch follows the discussion here[1], where Segher pointed
out the existing way to guard the extra penalized cost for
strided/elementwise loads with a magic bound doesn't scale.

The way with nunits * stmt_cost can get one much exaggerated
penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
that's why we need one bound.  To make it scale, this patch
doesn't use nunits * stmt_cost any more, but it still keeps
nunits since there are actually nunits scalar loads there.  So
it uses one cost adjusted from stmt_cost, since the current
stmt_cost sort of considers nunits, we can stablize the cost
for big nunits and retain the cost for small nunits.  After
some tries, this patch gets the adjusted cost as:

stmt_cost / (log2(nunits) * log2(nunits))

For V16QI, the adjusted cost would be 1 and total penalized
cost is 16, it isn't exaggerated.  For V2DI, the adjusted
cost would be 2 and total penalized cost is 4, which is the
same as before.  btw, I tried to use one single log2(nunits),
but the penalized cost is still big enough and can't fix the
degraded bmk blender_r.

The separated SPEC2017 evaluations on Power8, Power9 and Power10
at option sets O2-vect and Ofast-unroll showed this change is
neutral (that is same effect as before).

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
the way to compute extra penalized cost.

---
 gcc/config/rs6000/rs6000.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 4ab23b0ab33..e08b94c0447 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data 
*data,
{
  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
  unsigned int nunits = vect_nunits_for_cost (vectype);
- unsigned int extra_cost = nunits * stmt_cost;
- /* As function rs6000_builtin_vectorization_cost shows, we have
-priced much on V16QI/V8HI vector construction as their units,
-if we penalize them with nunits * stmt_cost, it can result in
-an unreliable body cost, eg: for V16QI on Power8, stmt_cost
-is 20 and nunits is 16, the extra cost is 320 which looks
-much exaggerated.  So let's use one maximum bound for the
-extra penalized cost for vector construction here.  */
- const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
- if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
-   extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+ /* As function rs6000_builtin_vectorization_cost shows, we
+have priced much on V16QI/V8HI vector construction by
+considering their units, if we penalize them with nunits
+* stmt_cost here, it can result in an unreliable body cost,
+eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16,
+the penalty will be 320 which looks much exaggerated.  But
+there are actually nunits scalar loads, so we try to adopt
+one reasonable penalized cost for each load rather than
+stmt_cost.  Here, with stmt_cost dividing by log2(nunits)^2,
+we can still retain the necessary penalty for small nunits
+meanwhile stabilize the penalty for big nunits.  */
+ int nunits_log2 = exact_log2 (nunits);
+ gcc_assert (nunits_log2 > 0);
+ unsigned int nunits_sq = nunits_log2 * nunits_log2;
+ unsigned int adjusted_cost = stmt_cost / nunits_sq;
+ gcc_assert (adjusted_cost > 0);
+ unsigned int extra_cost = nunits * adjusted_cost;
  data->extra_ctor_cost += extra_cost;
}
 }
--
2.25.1


[PATCH] rs6000: Parameterize some const values for density test

2021-09-15 Thread Kewen.Lin via Gcc-patches
Hi,

This patch follows the discussion here[1], where Segher suggested
parameterizing those exact magic constants for density heuristics,
to make it easier to tweak if need.

Since these heuristics are quite internal, I make these parameters
as undocumented and be mainly used by developers.

The change here should be "No Functional Change".  But I verified
it with SPEC2017 at option sets O2-vect and Ofast-unroll on Power8,
the result is neutral as expected.

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.opt (rs6000-density-pct-threshold,
rs6000-density-size-threshold, rs6000-density-penalty,
rs6000-density-load-pct-threshold,
rs6000-density-load-num-threshold): New parameter.
* config/rs6000/rs6000.c (rs6000_density_test): Adjust with
corresponding parameters.

---
 gcc/config/rs6000/rs6000.c   | 22 +++---
 gcc/config/rs6000/rs6000.opt | 21 +
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 9bc826e3a50..4ab23b0ab33 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5284,9 +5284,6 @@ struct rs6000_cost_data
 static void
 rs6000_density_test (rs6000_cost_data *data)
 {
-  const int DENSITY_PCT_THRESHOLD = 85;
-  const int DENSITY_SIZE_THRESHOLD = 70;
-  const int DENSITY_PENALTY = 10;
   struct loop *loop = data->loop_info;
   basic_block *bbs = get_loop_body (loop);
   int nbbs = loop->num_nodes;
@@ -5322,26 +5319,21 @@ rs6000_density_test (rs6000_cost_data *data)
   free (bbs);
   density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);

-  if (density_pct > DENSITY_PCT_THRESHOLD
-  && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD)
+  if (density_pct > rs6000_density_pct_threshold
+  && vec_cost + not_vec_cost > rs6000_density_size_threshold)
 {
-  data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
+  data->cost[vect_body] = vec_cost * (100 + rs6000_density_penalty) / 100;
   if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "density %d%%, cost %d exceeds threshold, penalizing "
-"loop body cost by %d%%\n", density_pct,
-vec_cost + not_vec_cost, DENSITY_PENALTY);
+"loop body cost by %u%%\n", density_pct,
+vec_cost + not_vec_cost, rs6000_density_penalty);
 }

   /* Check whether we need to penalize the body cost to account
  for excess strided or elementwise loads.  */
   if (data->extra_ctor_cost > 0)
 {
-  /* Threshold for load stmts percentage in all vectorized stmts.  */
-  const int DENSITY_LOAD_PCT_THRESHOLD = 45;
-  /* Threshold for total number of load stmts.  */
-  const int DENSITY_LOAD_NUM_THRESHOLD = 20;
-
   gcc_assert (data->nloads <= data->nstmts);
   unsigned int load_pct = (data->nloads * 100) / data->nstmts;

@@ -5355,8 +5347,8 @@ rs6000_density_test (rs6000_cost_data *data)
  the loads.
 One typical case is the innermost loop of the hotspot of SPEC2017
 503.bwaves_r without loop interchange.  */
-  if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
- && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
+  if (data->nloads > (unsigned int) rs6000_density_load_num_threshold
+ && load_pct > (unsigned int) rs6000_density_load_pct_threshold)
{
  data->cost[vect_body] += data->extra_ctor_cost;
  if (dump_enabled_p ())
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 0538db387dc..563983f3269 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -639,3 +639,24 @@ Enable instructions that guard against return-oriented 
programming attacks.
 mprivileged
 Target Var(rs6000_privileged) Init(0)
 Generate code that will run in privileged state.
+
+-param=rs6000-density-pct-threshold=
+Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) 
IntegerRange(0, 99) Param
+When costing for loop vectorization, we probably need to penalize the loop 
body cost if the existing cost model may not adequately reflect delays from 
unavailable vector resources.  We collect the cost for vectorized statements 
and non-vectorized statements separately, check the proportion of vec_cost to 
total cost of vec_cost and non vec_cost, and penalize only if the proportion 
exceeds the threshold specified by this parameter.  The default value is 85.
+
+-param=rs6000-density-size-threshold=
+Target Undocumented Joined UInteger Var(rs6000_density_size_threshold) 
Init(70) IntegerRange(0, 99) Param
+Like parameter rs6000-density-pct-threshold, we also check the total sum of 
vec_cost and non vec_cost, and penalize only if 

PING^1 [PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-09-15 Thread Kewen.Lin via Gcc-patches
Hi!

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578552.html

BR,
Kewen

on 2021/9/1 下午2:55, Kewen.Lin via Gcc-patches wrote:
> Hi!
> 
> This patch is to fix the inconsistent behaviors for non-LTO mode
> and LTO mode.  As Martin pointed out, currently the function
> rs6000_can_inline_p simply makes it inlinable if callee_tree is
> NULL, but it's wrong, we should use the command line options
> from target_option_default_node as default.  It also replaces
> rs6000_isa_flags with the one from target_option_default_node
> when caller_tree is NULL as rs6000_isa_flags could probably
> change since initialization.
> 
> It also extends the scope of the check for the case that callee
> has explicit set options, for test case pr102059-2.c inlining can
> happen unexpectedly before, it's fixed accordingly.
> 
> As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
> can be neglected for inlining, this patch also exludes them when
> the callee is attributed by always_inline.
> 
> Bootstrapped and regtested on powerpc64le-linux-gnu Power9.
> 
> BR,
> Kewen
> -
> gcc/ChangeLog:
> 
>   PR ipa/102059
>   * config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
>   target_option_default_node and consider always_inline_safe flags.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR ipa/102059
>   * gcc.target/powerpc/pr102059-1.c: New test.
>   * gcc.target/powerpc/pr102059-2.c: New test.
>   * gcc.target/powerpc/pr102059-3.c: New test.
>   * gcc.target/powerpc/pr102059-4.c: New test.
> 



PING^1 [PATCH] rs6000: Remove useless toc-fusion option

2021-09-15 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578553.html


BR,
Kewen

on 2021/9/1 下午2:56, Kewen.Lin via Gcc-patches wrote:
> Hi!
> 
> Option toc-fusion was intended for Power9 toc fusion previously,
> but Power9 doesn't support fusion at all eventually, this patch
> is to remove this useless option.
> 
> Is it ok for trunk?
> 
> BR,
> Kewen
> -
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000.opt (-mtoc-fusion): Remove.
> 





Re: [PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-14 Thread Kewen.Lin via Gcc-patches
rs & bif_htmcr_bit)
> #define bif_is_mma(x)   ((x).bifattrs & bif_mma_bit)
> #define bif_is_quad(x)  ((x).bifattrs & bif_quad_bit)
> #define bif_is_pair(x)  ((x).bifattrs & bif_pair_bit)
> #define bif_is_mmaint(x)((x).bifattrs & bif_mmaint_bit)
> #define bif_is_no32bit(x)   ((x).bifattrs & bif_no32bit_bit)
> #define bif_is_32bit(x) ((x).bifattrs & bif_32bit_bit)
> #define bif_is_cpu(x)   ((x).bifattrs & bif_cpu_bit)
> #define bif_is_ldstmask(x)  ((x).bifattrs & bif_ldstmask_bit)
> #define bif_is_lxvrse(x)((x).bifattrs & bif_lxvrse_bit)
> #define bif_is_lxvrze(x)((x).bifattrs & bif_lxvrze_bit)
> #define bif_is_endian(x)((x).bifattrs & bif_endian_bit)
> 
> extern bifdata rs6000_builtin_info_x[RS6000_BIF_MAX];
> 
> On 9/8/21 2:43 AM, Kewen.Lin wrote:
>> Hi!
>>
>> Power ISA 2.07 (Power8) introduces transactional memory feature
>> but ISA3.1 (Power10) removes it.  It exposes one troublesome
>> issue as PR102059 shows.  Users define some function with
>> target pragma cpu=power10 then it calls one function with
>> attribute always_inline which inherits command line option
>> cpu=power8 which enables HTM implicitly.  The current isa_flags
>> check doesn't allow this inlining due to "target specific
>> option mismatch" and error mesasge is emitted.
>>
>> Normally, the callee function isn't intended to exploit HTM
>> feature, but the default flag setting make it look it has.
>> As Richi raised in the PR, we have fp_expressions flag in
>> function summary, and allow us to check the function actually
>> contains any floating point expressions to avoid overkill.
>> So this patch follows the similar idea but is more target
>> specific, for this rs6000 port specific requirement on HTM
>> feature check, we would like to check rs6000 specific HTM
>> built-in functions and inline assembly, it allows targets
>> to do their own customized checks and updates.
>>
>> It introduces two target hooks need_ipa_fn_target_info and
>> update_ipa_fn_target_info.  The former allows target to do
>> some previous check and decides to collect target specific
>> information for this function or not.  For some special case,
>> it can predict the analysis result and push it early without
>> any scannings.  The latter allows the analyze_function_body
>> to pass gimple stmts down just like fp_expressions handlings,
>> target can do its own tricks.  I put them as one hook initially
>> with one boolean to indicates whether it's initial time, but
>> the code looks a bit ugly, to separate them seems to have
>> better readability.
>>
>> To make it simple, this patch uses HOST_WIDE_INT to record the
>> flags just like what we use for isa_flags.  For rs6000's HTM
>> need, one HOST_WIDE_INT variable is quite enough, but it seems
>> good to have one auto_vec for scalability as I noticed some
>> targets have more than one HOST_WIDE_INT flag.
>>
>> Against v1 [1], this v2 addressed Richi's and Segher's review
>> comments, mainly consists of:
>>   - Extend it to cover non always_inline.
>>   - Exclude the case for offload streaming.
>>   - Some function naming and formatting issues.
>>   - Adjust rs6000_can_inline_p.
>>   - Add new cases.
>>
>> The patch has been bootstrapped and regress-tested on
>> powerpc64le-linux-gnu Power9.
>>
>> Any comments are highly appreciated!
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578555.html
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  PR ipa/102059
>>  * config/rs6000/rs6000-call.c (rs6000_fn_has_any_of_these_mask_bits):
>>  New function.
>>  * config/rs6000/rs6000-internal.h
>>  (rs6000_fn_has_any_of_these_mask_bits): New declare.
>>  * config/rs6000/rs6000.c (TARGET_NEED_IPA_FN_TARGET_INFO): New macro.
>>  (TARGET_UPDATE_IPA_FN_TARGET_INFO): Likewise.
>>  (rs6000_need_ipa_fn_target_info): New function.
>>  (rs6000_update_ipa_fn_target_info): Likewise.
>>  (rs6000_can_inline_p): Adjust for ipa function summary target info.
>>  * ipa-fnsummary.c (ipa_dump_fn_summary): Adjust for ipa function
>>  summary target info.
>>  (analyze_function_body): Adjust for ipa function summary target
>>  info and call hook rs6000_need_ipa_fn_target_info and
>>  rs6000_update_ipa_fn_target_info.
>>  (ipa_merge_fn_summary_after_inlining): Adjust for ipa function
>>  summary target info.
>>  (inline_read_sect

[committed] rs6000: Remove typedef for struct rs6000_cost_data

2021-09-13 Thread Kewen.Lin via Gcc-patches
Hi,

This patch follows Segher's suggestion here[1] to get rid of 
the typedef, it's pre-approved as [1].

Bootstrapped and regtested on powerpc64le-linux-gnu Power9.

Pushed to trunk as r12-3468.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579115.html

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.c (struct rs6000_cost_data): Remove typedef.
(rs6000_init_cost): Adjust.

--
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index b7ea1483da5..39d428db8e6 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5262,7 +5262,7 @@ rs6000_preferred_simd_mode (scalar_mode mode)
   return word_mode;
 }

-typedef struct _rs6000_cost_data
+struct rs6000_cost_data
 {
   struct loop *loop_info;
   unsigned cost[3];
@@ -5271,7 +5271,7 @@ typedef struct _rs6000_cost_data
   bool vect_nonmem;
   /* Indicates this is costing for the scalar version of a loop or block.  */
   bool costing_for_scalar;
-} rs6000_cost_data;
+};

 /* Test for likely overcommitment of vector hardware resources.  If a
loop iteration is relatively large, and too large a percentage of
@@ -5337,7 +5337,7 @@ rs6000_density_test (rs6000_cost_data *data)
 static void *
 rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
 {
-  rs6000_cost_data *data = XNEW (struct _rs6000_cost_data);
+  rs6000_cost_data *data = XNEW (rs6000_cost_data);
   data->loop_info = loop_info;
   data->cost[vect_prologue] = 0;
   data->cost[vect_body] = 0;


Re: [PATCH v4] rs6000: Add load density heuristic

2021-09-09 Thread Kewen.Lin via Gcc-patches
on 2021/9/10 上午11:22, Kewen.Lin via Gcc-patches wrote:
> Hi Segher and Bill,
> 
> Thanks a lot for your reviews and helps!
> 
> on 2021/9/10 上午1:19, Bill Schmidt wrote:
>> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>>> +  /* If we have strided or elementwise loads into a vector, it's
>>>>>> + possible to be bounded by latency and execution resources for
>>>>>> + many scalar loads.  Try to account for this by scaling the
>>>>>> + construction cost by the number of elements involved, when
>>>>>> + handling each matching statement we record the possible extra
>>>>>> + penalized cost into target cost, in the end of costing for
>>>>>> + the whole loop, we do the actual penalization once some load
>>>>>> + density heuristics are satisfied.  */
>>>>> The above comment is quite hard to read.  Can you please break up the last
>>>>> sentence into at least two sentences?
>>>> How about the below:
>>>>
>>>> +  /* If we have strided or elementwise loads into a vector, it's
>>> "strided" is not a word: it properly is "stridden", which does not read
>>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>>> good English, and easier to understand?
>>
>> No, this is OK.  "Strided loads" is a term of art used by the vectorizer; 
>> whether or not it was the Queen's English, it's what we have...  (And I 
>> think you might only find "bestridden" in some 18th or 19th century English 
>> poetry... :-)
>>>
>>>> +    possible to be bounded by latency and execution resources for
>>>> +    many scalar loads.  Try to account for this by scaling the
>>>> +    construction cost by the number of elements involved.  For
>>>> +    each matching statement, we record the possible extra
>>>> +    penalized cost into the relevant field in target cost.  When
>>>> +    we want to finalize the whole loop costing, we will check if
>>>> +    those related load density heuristics are satisfied, and add
>>>> +    this accumulated penalized cost if yes.  */
>>>>
>>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>>> that clarified.
>>> Does that text look good to you now Bill?  It is still kinda complex,
>>> maybe you see a way to make it simpler.
>>
>> I think it's OK now.  The complexity at least matches the code now instead 
>> of exceeding it. :-P  j/k...
>>
> 
> Just noticed Bill helped to revise it, I will use that nice paragraph.
> (Thanks, Bill!)
> 
>>>
>>>> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>>> nstmts, nloads and extra_ctor_cost.
>>>> (rs6000_density_test): Add load density related heuristics and the
>>>> checks, do extra costing on vector construction statements if need.
>>> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
>>> out this whole phrase part :-)
>>>
>>> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
>>> bit boring text that way, but that is the purpose: it makes it simpler
>>> to read (and read quickly, even merely scan).
>>>
> 
> Thanks for the explanation, will fix.
> 
>>>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
>>> [ Btw, you can get rid of the typedef now, just have a struct with the
>>> non-underscore name, we have C++ now.  Such a mechanical change (as
>>> separate patch!) is pre-approved. ]
>>>
>>>> +  /* Check if we need to penalize the body cost for latency and
>>>> + execution resources bound from strided or elementwise loads
>>>> + into a vector.  */
>>> Bill, is that clear enough?  I'm sure something nicer would help here,
>>> but it's hard for me to write anything :-)
>>
>> Perhaps:  "Check whether we need to penalize the body cost to account for 
>> excess strided or elementwise loads."
> 
> Thanks, will update.
> 
>>>
>>>> +  if (data->extra_ctor_cost > 0)
>>>> +    {
>>>> +  /* Threshold for load stmts percentage in all vectorized stmts.  */
>>>> +  const int D

Re: [PATCH v4] rs6000: Add load density heuristic

2021-09-09 Thread Kewen.Lin via Gcc-patches
Hi Segher and Bill,

Thanks a lot for your reviews and helps!

on 2021/9/10 上午1:19, Bill Schmidt wrote:
> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>> Hi!
>>
>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>> +  /* If we have strided or elementwise loads into a vector, it's
>>>>> + possible to be bounded by latency and execution resources for
>>>>> + many scalar loads.  Try to account for this by scaling the
>>>>> + construction cost by the number of elements involved, when
>>>>> + handling each matching statement we record the possible extra
>>>>> + penalized cost into target cost, in the end of costing for
>>>>> + the whole loop, we do the actual penalization once some load
>>>>> + density heuristics are satisfied.  */
>>>> The above comment is quite hard to read.  Can you please break up the last
>>>> sentence into at least two sentences?
>>> How about the below:
>>>
>>> +  /* If we have strided or elementwise loads into a vector, it's
>> "strided" is not a word: it properly is "stridden", which does not read
>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>> good English, and easier to understand?
> 
> No, this is OK.  "Strided loads" is a term of art used by the vectorizer; 
> whether or not it was the Queen's English, it's what we have...  (And I think 
> you might only find "bestridden" in some 18th or 19th century English 
> poetry... :-)
>>
>>> +    possible to be bounded by latency and execution resources for
>>> +    many scalar loads.  Try to account for this by scaling the
>>> +    construction cost by the number of elements involved.  For
>>> +    each matching statement, we record the possible extra
>>> +    penalized cost into the relevant field in target cost.  When
>>> +    we want to finalize the whole loop costing, we will check if
>>> +    those related load density heuristics are satisfied, and add
>>> +    this accumulated penalized cost if yes.  */
>>>
>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>> that clarified.
>> Does that text look good to you now Bill?  It is still kinda complex,
>> maybe you see a way to make it simpler.
> 
> I think it's OK now.  The complexity at least matches the code now instead of 
> exceeding it. :-P  j/k...
> 

Just noticed Bill helped to revise it, I will use that nice paragraph.
(Thanks, Bill!)

>>
>>> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>> nstmts, nloads and extra_ctor_cost.
>>> (rs6000_density_test): Add load density related heuristics and the
>>> checks, do extra costing on vector construction statements if need.
>> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
>> out this whole phrase part :-)
>>
>> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
>> bit boring text that way, but that is the purpose: it makes it simpler
>> to read (and read quickly, even merely scan).
>>

Thanks for the explanation, will fix.

>>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
>> [ Btw, you can get rid of the typedef now, just have a struct with the
>> non-underscore name, we have C++ now.  Such a mechanical change (as
>> separate patch!) is pre-approved. ]
>>
>>> +  /* Check if we need to penalize the body cost for latency and
>>> + execution resources bound from strided or elementwise loads
>>> + into a vector.  */
>> Bill, is that clear enough?  I'm sure something nicer would help here,
>> but it's hard for me to write anything :-)
> 
> Perhaps:  "Check whether we need to penalize the body cost to account for 
> excess strided or elementwise loads."

Thanks, will update.

>>
>>> +  if (data->extra_ctor_cost > 0)
>>> +    {
>>> +  /* Threshold for load stmts percentage in all vectorized stmts.  */
>>> +  const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> Threshold for what?
>>
>> 45% is awfully exact.  Can you make this a param?
>>
>>> +  /* Threshold for total number of load stmts.  */
>>> +  const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> Same.
> 
> 
> We have similar magic constants in here already.  Parameterizing is possible, 
> but I'm more interested in making sure the numbers are 

Re: [PATCH v4] rs6000: Add load density heuristic

2021-09-08 Thread Kewen.Lin via Gcc-patches
on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote:
> Hi Bill,
> 
> Thanks for the review comments!
> 
> on 2021/9/3 下午11:57, Bill Schmidt wrote:
>> Hi Kewen,
>>
>> Sorry that we lost track of this patch!  The heuristic approach looks good.  
>> It is limited in scope and won't kick in often, and the case you're trying 
>> to account for is important.
>>
>> At the time you submitted this, I think reliable P10 testing wasn't 
>> possible.  Now that it is, could you please do a quick sniff test to make 
>> sure there aren't any adjustments that need to be made for P10?  I doubt it, 
>> but worth checking.
>>
> 
> Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 
> with Ofast unroll, this patch is neutral,
> one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r 
> degradation existed or not and if it can fixed by
> this patch.  The result shows the degradation did exist and got fixed by this 
> patch, besides got extra 3.93% speedup
> against O2 and another bmk 554.roms_r got 3.24% speed up.
> 

hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran 
it with 10 iterations for both w/ and w/o
the patch, both suffer the jitters and the best scores of them are close.  But 
note that bwaves_r scores are quite stable
so it's reliable.

BR,
Kewen

> In short, the Power10 evaluation result shows this patch is positive.
> 
>> Otherwise I have one comment below...
>>
>> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>>> Hi,
>>>
>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>>
>>> This v3 addressed William's review comments in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>>
>>> It's mainly to deal with the bwaves_r degradation due to vector
>>> construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations (with v2, as v3 is NFC against v2):
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
> 
> ...
> 
>>> +  /* Gather some information when we are costing the vectorized instruction
>>> + for the statements located in a loop body.  */
>>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>> +{
>>> +  data->nstmts += orig_count;
>>> +
>>> +  if (kind == scalar_load || kind == vector_load
>>> + || kind == unaligned_load || kind == vector_gather_load)
>>> +   data->nloads += orig_count;
>>> +
>>> +  /* If we have strided or element

[PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-08 Thread Kewen.Lin via Gcc-patches
Hi!

Power ISA 2.07 (Power8) introduces transactional memory feature
but ISA3.1 (Power10) removes it.  It exposes one troublesome
issue as PR102059 shows.  Users define some function with
target pragma cpu=power10 then it calls one function with
attribute always_inline which inherits command line option
cpu=power8 which enables HTM implicitly.  The current isa_flags
check doesn't allow this inlining due to "target specific
option mismatch" and error mesasge is emitted.

Normally, the callee function isn't intended to exploit HTM
feature, but the default flag setting make it look it has.
As Richi raised in the PR, we have fp_expressions flag in
function summary, and allow us to check the function actually
contains any floating point expressions to avoid overkill.
So this patch follows the similar idea but is more target
specific, for this rs6000 port specific requirement on HTM
feature check, we would like to check rs6000 specific HTM
built-in functions and inline assembly, it allows targets
to do their own customized checks and updates.

It introduces two target hooks need_ipa_fn_target_info and
update_ipa_fn_target_info.  The former allows target to do
some previous check and decides to collect target specific
information for this function or not.  For some special case,
it can predict the analysis result and push it early without
any scannings.  The latter allows the analyze_function_body
to pass gimple stmts down just like fp_expressions handlings,
target can do its own tricks.  I put them as one hook initially
with one boolean to indicates whether it's initial time, but
the code looks a bit ugly, to separate them seems to have
better readability.

To make it simple, this patch uses HOST_WIDE_INT to record the
flags just like what we use for isa_flags.  For rs6000's HTM
need, one HOST_WIDE_INT variable is quite enough, but it seems
good to have one auto_vec for scalability as I noticed some
targets have more than one HOST_WIDE_INT flag.

Against v1 [1], this v2 addressed Richi's and Segher's review
comments, mainly consists of:
  - Extend it to cover non always_inline.
  - Exclude the case for offload streaming.
  - Some function naming and formatting issues.
  - Adjust rs6000_can_inline_p.
  - Add new cases.

The patch has been bootstrapped and regress-tested on
powerpc64le-linux-gnu Power9.

Any comments are highly appreciated!

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578555.html

BR,
Kewen
-
gcc/ChangeLog:

PR ipa/102059
* config/rs6000/rs6000-call.c (rs6000_fn_has_any_of_these_mask_bits):
New function.
* config/rs6000/rs6000-internal.h
(rs6000_fn_has_any_of_these_mask_bits): New declare.
* config/rs6000/rs6000.c (TARGET_NEED_IPA_FN_TARGET_INFO): New macro.
(TARGET_UPDATE_IPA_FN_TARGET_INFO): Likewise.
(rs6000_need_ipa_fn_target_info): New function.
(rs6000_update_ipa_fn_target_info): Likewise.
(rs6000_can_inline_p): Adjust for ipa function summary target info.
* ipa-fnsummary.c (ipa_dump_fn_summary): Adjust for ipa function
summary target info.
(analyze_function_body): Adjust for ipa function summary target
info and call hook rs6000_need_ipa_fn_target_info and
rs6000_update_ipa_fn_target_info.
(ipa_merge_fn_summary_after_inlining): Adjust for ipa function
summary target info.
(inline_read_section): Likewise.
(ipa_fn_summary_write): Likewise.
* ipa-fnsummary.h (ipa_fn_summary::target_info): New member.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_UPDATE_IPA_FN_TARGET_INFO): Document new
hook.
(TARGET_NEED_IPA_FN_TARGET_INFO): Likewise.
* target.def (update_ipa_fn_target_info): New hook.
(need_ipa_fn_target_info): Likewise.
* targhooks.c (default_need_ipa_fn_target_info): New function.
(default_update_ipa_fn_target_info): Likewise.
* targhooks.h (default_update_ipa_fn_target_info): New declare.
(default_need_ipa_fn_target_info): Likewise.

gcc/testsuite/ChangeLog:

PR ipa/102059
* gcc.dg/lto/pr102059-1_0.c: New test.
* gcc.dg/lto/pr102059-1_1.c: New test.
* gcc.dg/lto/pr102059-1_2.c: New test.
* gcc.dg/lto/pr102059-2_0.c: New test.
* gcc.dg/lto/pr102059-2_1.c: New test.
* gcc.dg/lto/pr102059-2_2.c: New test.
* gcc.target/powerpc/pr102059-5.c: New test.
* gcc.target/powerpc/pr102059-6.c: New test.
* gcc.target/powerpc/pr102059-7.c: New test.
diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
index fd7f24da818..6af0d15ed87 100644
--- a/gcc/config/rs6000/rs6000-call.c
+++ b/gcc/config/rs6000/rs6000-call.c
@@ -13795,6 +13795,18 @@ rs6000_builtin_decl (unsigned code, bool initialize_p 
ATTRIBUTE_UNUSED)
   return rs6000_builtin_decls[code];
 }
 
+/* Return true if the builtin with CODE has any mask bits set
+   which are specified 

PING^3 [PATCH v2] combine: Tweak the condition of last_set invalidation

2021-09-08 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html

BR,
Kewen

on 2021/7/15 上午10:00, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html
> 
> BR,
> Kewen
> 
> on 2021/6/28 下午3:00, Kewen.Lin via Gcc-patches wrote:
>> Hi!
>>
>> I'd like to gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html
>>
>>
>> BR,
>> Kewen
>>
>> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote:
>>> Hi Segher,
>>>
>>> Thanks for the review!
>>>
>>> on 2021/6/10 上午4:17, Segher Boessenkool wrote:
>>>> Hi!
>>>>
>>>> On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
>>>>> Currently we have the check:
>>>>>
>>>>>   if (!insn
>>>>> || (value && rsp->last_set_table_tick >= label_tick_ebb_start))
>>>>>   rsp->last_set_invalid = 1; 
>>>>>
>>>>> which means if we want to record some value for some reg and
>>>>> this reg got refered before in a valid scope,
>>>>
>>>> If we already know it is *set* in this same extended basic block.
>>>> Possibly by the same instruction btw.
>>>>
>>>>> we invalidate the
>>>>> set of reg (last_set_invalid to 1).  It avoids to find the wrong
>>>>> set for one reg reference, such as the case like:
>>>>>
>>>>>... op regX  // this regX could find wrong last_set below
>>>>>regX = ...   // if we think this set is valid
>>>>>... op regX
>>>>
>>>> Yup, exactly.
>>>>
>>>>> But because of retry's existence, the last_set_table_tick could
>>>>> be set by some later reference insns, but we see it's set due
>>>>> to retry on the set (for that reg) insn again, such as:
>>>>>
>>>>>insn 1
>>>>>insn 2
>>>>>
>>>>>regX = ... --> (a)
>>>>>... op regX--> (b)
>>>>>
>>>>>insn 3
>>>>>
>>>>>// assume all in the same BB.
>>>>>
>>>>> Assuming we combine 1, 2 -> 3 sucessfully and replace them as two
>>>>> (3 insns -> 2 insns),
>>>>
>>>> This will delete insn 1 and write the combined result to insns 2 and 3.
>>>>
>>>>> retrying from insn1 or insn2 again:
>>>>
>>>> Always 2, but your point remains valid.
>>>>
>>>>> it will scan insn (a) again, the below condition holds for regX:
>>>>>
>>>>>   (value && rsp->last_set_table_tick >= label_tick_ebb_start)
>>>>>
>>>>> it will mark this set as invalid set.  But actually the
>>>>> last_set_table_tick here is set by insn (b) before retrying, so it
>>>>> should be safe to be taken as valid set.
>>>>
>>>> Yup.
>>>>
>>>>> This proposal is to check whether the last_set_table safely happens
>>>>> after the current set, make the set still valid if so.
>>>>
>>>>> Full SPEC2017 building shows this patch gets more sucessful combines
>>>>> from 1902208 to 1902243 (trivial though).
>>>>
>>>> Do you have some example, or maybe even a testcase?  :-)
>>>>
>>>
>>> Sorry for the late reply, it took some time to get one reduced case.
>>>
>>> typedef struct SA *pa_t;
>>>
>>> struct SC {
>>>   int h;
>>>   pa_t elem[];
>>> };
>>>
>>> struct SD {
>>>   struct SC *e;
>>> };
>>>
>>> struct SA {
>>>   struct {
>>> struct SD f[1];
>>>   } g;
>>> };
>>>
>>> void foo(pa_t *k, char **m) {
>>>   int l, i;
>>>   pa_t a;
>>>   l = (int)a->g.f[5].e;
>>>   i = 0;
>>>   for (; i < l; i++) {
>>> k[i] = a->g.f[5].e->elem[i];
>>> m[i] = "";
>>>   }
>>> }
>>>
>>> Baseline is r12-0 and the option is "-O3 -mcpu=power9 -fno-strict-aliasing",
>>> with this patch, the generated assembly can save two rlwinm s.
>>>
>>>>> +  /* Record the luid of the insn whose expression involving 

Re: [PATCH v2] rs6000: Add load density heuristic

2021-09-08 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the comments!

on 2021/9/7 上午7:43, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote:
>>>> +/* As a visitor function for each statement cost entry handled in
>>>> +   function add_stmt_cost, gather some information and update its
>>>> +   relevant fields in target cost accordingly.  */
>>>
>>> I got lost trying to parse that..  (could be just me :-) 
>>>
>>> Possibly instead something like
>>> /* Helper function for add_stmt_cost ; gather information and update
>>> the target_cost fields accordingly.  */
>>
>> OK, will update.  I was thinking for each entry handled in function
>> add_stmt_cost, this helper acts like a visitor, trying to visit each
>> entry and take some actions if some conditions are satisifed.
> 
> It (thankfully!) has nothing to do with the "visitor pattern", so some
> other name might be better :-)
> 
>>> Maybe clearer to read if you rearrange slightly and flatten it ?  I
>>> defer to others on that..
>>>
>>> if ((kind == vec_to_scalar
>>>  || kind == vec_perm
>>>  || kind == vec_promote_demote
>>>  || kind == vec_construct
>>>  || kind == scalar_to_vec)
>>> || (kind == vector_stmt && where == vect_body)
>>
>> This hunk is factored out from function rs6000_add_stmt_cost, maybe I
>> can keep the original formatting?  The formatting tool isn't so smart,
>> and sometimes rearrange things to become unexpected (although it meets
>> the basic rule, not so elegant), sigh.
> 
> It has too many parens, making grouping where there is none, that is the
> core issue.
> 
>   if (kind == vec_to_scalar
>   || kind == vec_perm
>   || kind == vec_promote_demote
>   || kind == vec_construct
>   || kind == scalar_to_vec
>   || (kind == vector_stmt && where == vect_body))
> 
> 

Good catch, I've updated it in V4.

BR,
Kewen


[PATCH v4] rs6000: Add load density heuristic

2021-09-08 Thread Kewen.Lin via Gcc-patches
Hi Bill,

Thanks for the review comments!

on 2021/9/3 下午11:57, Bill Schmidt wrote:
> Hi Kewen,
> 
> Sorry that we lost track of this patch!  The heuristic approach looks good.  
> It is limited in scope and won't kick in often, and the case you're trying to 
> account for is important.
> 
> At the time you submitted this, I think reliable P10 testing wasn't possible. 
>  Now that it is, could you please do a quick sniff test to make sure there 
> aren't any adjustments that need to be made for P10?  I doubt it, but worth 
> checking.
> 

Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 
with Ofast unroll, this patch is neutral,
one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r 
degradation existed or not and if it can fixed by
this patch.  The result shows the degradation did exist and got fixed by this 
patch, besides got extra 3.93% speedup
against O2 and another bmk 554.roms_r got 3.24% speed up.

In short, the Power10 evaluation result shows this patch is positive.

> Otherwise I have one comment below...
> 
> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>> Hi,
>>
>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>
>> This v3 addressed William's review comments in
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>
>> It's mainly to deal with the bwaves_r degradation due to vector
>> construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
>>
>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations (with v2, as v3 is NFC against v2):
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
>>

...

>>+  /* Gather some information when we are costing the vectorized instruction
>>+ for the statements located in a loop body.  */
>>+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>+{
>>+  data->nstmts += orig_count;
>>+
>>+  if (kind == scalar_load || kind == vector_load
>>+   || kind == unaligned_load || kind == vector_gather_load)
>>+ data->nloads += orig_count;
>>+
>>+  /* If we have strided or elementwise loads into a vector, it's
>>+  possible to be bounded by latency and execution resources for
>>+  many scalar loads.  Try to account for this by scaling the
>>+  construction cost by the number of elements involved, when
>>+  handling each matching statement we record the possible extra
>>+  penalized cost into target cost, in the end of costing for
>>+  the whole loop, we do the actual penalization once some load
>>+  density heuristics are satisfied.  */
> 
> The above comment is quite hard to read.  Can you please break up the last
> sentence into at least two sentences?
> 

How about the below:

+  /* If we have stride

Re: [RFC/PATCH] ipa-inline: Add target info into fn summary [PR102059]

2021-09-02 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the comments!

on 2021/9/3 上午1:44, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Sep 01, 2021 at 03:02:22PM +0800, Kewen.Lin wrote:
>> It introduces two target hooks need_ipa_fn_target_info and
>> update_ipa_fn_target_info.  The former allows target to do
>> some previous check and decides to collect target specific
>> information for this function or not.  For some special case,
>> it can predict the analysis result and push it early without
>> any scannings.  The latter allows the analyze_function_body
>> to pass gimple stmts down just like fp_expressions handlings,
>> target can do its own tricks.
>>
>> To make it simple, this patch uses HOST_WIDE_INT to record the
>> flags just like what we use for isa_flags.  For rs6000's HTM
>> need, one HOST_WIDE_INT variable is quite enough, but it seems
>> good to have one auto_vec for scalability as I noticed some
>> targets have more than one HOST_WIDE_INT flag.  For now, this
>> target information collection is only for always_inline function,
>> function ipa_merge_fn_summary_after_inlining deals with target
>> information merging.
> 
> These flags can in principle be separate from any flags the target
> keeps, so 64 bits will be enough for a long time.  If we want to
> architect that better, we should really architect the way all targets
> do target flags first.  Let's not go there now :-)
> 
> So just one HOST_WIDE_INT, not a stack of them please?

I considered this, it's fine to use this customized bit in the target hook,
but back to target hook can_inline_p, we have to decoded them to the bits
in isa_flags separately, it's inefficient than just using the whole mask
if the interesting bits are more.

As the discussion with Richi, theoretically speaking if target likes, it can
try to scan for many isa features with target's own desicions, there could be
much more bits.  Another thing inspiring me to make it with one vector is that
i386 port ix86_can_inline_p checks x_ix86_target_flags, x_ix86_isa_flags,
x_ix86_isa_flags2, arch and tune etc. now, one HOST_WIDE_INT seems not good
to it, if it wants to check more.  ;-)

> 
>> --- a/gcc/config/rs6000/rs6000-call.c
>> +++ b/gcc/config/rs6000/rs6000-call.c
>> @@ -13642,6 +13642,17 @@ rs6000_builtin_decl (unsigned code, bool 
>> initialize_p ATTRIBUTE_UNUSED)
>>return rs6000_builtin_decls[code];
>>  }
>>  
>> +/* Return true if the builtin with CODE has any mask bits set
>> +   which are specified by MASK.  */
>> +
>> +bool
>> +rs6000_builtin_mask_set_p (unsigned code, HOST_WIDE_INT mask)
>> +{
>> +  gcc_assert (code < RS6000_BUILTIN_COUNT);
>> +  HOST_WIDE_INT fnmask = rs6000_builtin_info[code].mask;
>> +  return fnmask & mask;
>> +}
> 
> The "_p" does not say that "any bits" part, which is crucial here.  So
> name this something like "rs6000_fn_has_any_of_these_mask_bits"?  Yes
> the name sucks, because this interface does :-P
> 

Thanks for the name, will fix it.  :)

> Its it useful to have "any" semantics at all?  Otherwise, require this
> to be passed just a single bit?
> 

Since we can not just pass in a bit, we have to assert it with something
like:

   gcc_assert (__builtin_popcount(mask) == 1);

to claim it's checking a single bit.  But the implementation logic still
supports checking any bits, so I thought we can just claim it to check
any bits and a single bit is just one special case.

Yeah, not sure if there is a need to check any bits, but something like
checking exists FRSQRTE and FRSQRTES bifs can pass (RS6000_BTM_FRSQRTE |
RS6000_BTM_FRSQRTES), so is it fine to keep it for any bits?

> The implicit "!!" (or "!= 0", same thing) that casting to bool does
> might be better explicit, too?  A cast to bool changes value so is more
> suprising than other casts.

OK, will fix it.

> 
>> +  /* Assume inline asm can use any instruction features.  */
>> +  if (gimple_code (stmt) == GIMPLE_ASM)
>> +{
>> +  info[0] = -1;
>> +  return false;
>> +}
> 
> What is -1 here?  "All options set"?  Does that work?  Reliably?
> 

Good question, in the current implementation it's reliable, since we do
operation "~" first then & the interesting bits (OPTION_MASK_HTM here)
but I think you concerned some conflict bits co-exists is reasonable or
not.  I was intended to cover any future interesting bits, but I agree
it's better to just set the correpsonding intersting bits to make it clear.

Will fix it.

>> +  if (fndecl && fndecl_built_in_p (fndecl, BUILT_IN_MD))
>> +{
>> +  enum rs6000_builtins fcode =
>> +(enu

Re: [RFC/PATCH] ipa-inline: Add target info into fn summary [PR102059]

2021-09-02 Thread Kewen.Lin via Gcc-patches
on 2021/9/2 下午7:51, Richard Biener wrote:
> On Thu, Sep 2, 2021 at 1:13 PM Kewen.Lin  wrote:
>>
>> Hi Richi,
>>
>> Thanks for the comments!
>>
>> on 2021/9/2 下午5:25, Richard Biener wrote:
>>> On Wed, Sep 1, 2021 at 9:02 AM Kewen.Lin  wrote:
>>>>
>>>> Hi!
>>>>
>>>> Power ISA 2.07 (Power8) introduces transactional memory feature
>>>> but ISA3.1 (Power10) removes it.  It exposes one troublesome
>>>> issue as PR102059 shows.  Users define some function with
>>>> target pragma cpu=power10 then it calls one function with
>>>> attribute always_inline which inherits command line option
>>>> cpu=power8 which enables HTM implicitly.  The current isa_flags
>>>> check doesn't allow this inlining due to "target specific
>>>> option mismatch" and error mesasge is emitted.
>>>>
>>>> Normally, the callee function isn't intended to exploit HTM
>>>> feature, but the default flag setting make it look it has.
>>>> As Richi raised in the PR, we have fp_expressions flag in
>>>> function summary, and allow us to check the function actually
>>>> contains any floating point expressions to avoid overkill.
>>>> So this patch follows the similar idea but is more target
>>>> specific, for this rs6000 port specific requirement on HTM
>>>> feature check, we would like to check rs6000 specific HTM
>>>> built-in functions and inline assembly, it allows targets
>>>> to do their own customized checks and updates.
>>>>
>>>> It introduces two target hooks need_ipa_fn_target_info and
>>>> update_ipa_fn_target_info.  The former allows target to do
>>>> some previous check and decides to collect target specific
>>>> information for this function or not.  For some special case,
>>>> it can predict the analysis result and push it early without
>>>> any scannings.  The latter allows the analyze_function_body
>>>> to pass gimple stmts down just like fp_expressions handlings,
>>>> target can do its own tricks.  I put them as one hook initially
>>>> with one boolean to indicates whether it's initial time, but
>>>> the code looks a bit ugly, to separate them seems to have
>>>> better readability.
>>>>
>>>> To make it simple, this patch uses HOST_WIDE_INT to record the
>>>> flags just like what we use for isa_flags.  For rs6000's HTM
>>>> need, one HOST_WIDE_INT variable is quite enough, but it seems
>>>> good to have one auto_vec for scalability as I noticed some
>>>> targets have more than one HOST_WIDE_INT flag.  For now, this
>>>> target information collection is only for always_inline function,
>>>> function ipa_merge_fn_summary_after_inlining deals with target
>>>> information merging.
>>>>
>>>> The patch has been bootstrapped and regress-tested on
>>>> powerpc64le-linux-gnu Power9.
>>>>
>>>> Is it on the right track?
>>>
>>> +  if (always_inline)
>>> +{
>>> +  cgraph_node *callee_node = cgraph_node::get (callee);
>>> +  if (ipa_fn_summaries && ipa_fn_summaries->get (callee_node) != NULL)
>>> +   {
>>> + if (dump_file)
>>> +   ipa_dump_fn_summary (dump_file, callee_node);
>>> + const vec  =
>>> +   ipa_fn_summaries->get (callee_node)->target_info;
>>> + if (!info.is_empty ())
>>> +   always_inline_safe_mask |= ~info[0] & OPTION_MASK_HTM;
>>> +   }
>>> +
>>> +  caller_isa &= ~always_inline_safe_mask;
>>> +  callee_isa &= ~always_inline_safe_mask;
>>> +}
>>>
>>> that's a bit convoluted but obviously the IPA info can be used for
>>> non-always_inline cases as well.
>>>
>>> As said above the info can be produced for not always-inline functions
>>> as well, the usual case would be for LTO inlining across TUs compiled
>>> with different target options.  In your case the special -mcpu=power10
>>> TU would otherwise not be able to inline from a general -mcpu=power8 TU.
>>>
>>
>> Agree it can be extended to non-always_inline cases.  Since always_inline
>> is kind of user "forced" requirement and compiler emits error if it fails
>> to inline, while non-always_inline will have warning instead.  Considering
>> the scanning might be considered as costly for 

Re: [RFC/PATCH] ipa-inline: Add target info into fn summary [PR102059]

2021-09-02 Thread Kewen.Lin via Gcc-patches
Hi Richi,

Thanks for the comments!

on 2021/9/2 下午5:25, Richard Biener wrote:
> On Wed, Sep 1, 2021 at 9:02 AM Kewen.Lin  wrote:
>>
>> Hi!
>>
>> Power ISA 2.07 (Power8) introduces transactional memory feature
>> but ISA3.1 (Power10) removes it.  It exposes one troublesome
>> issue as PR102059 shows.  Users define some function with
>> target pragma cpu=power10 then it calls one function with
>> attribute always_inline which inherits command line option
>> cpu=power8 which enables HTM implicitly.  The current isa_flags
>> check doesn't allow this inlining due to "target specific
>> option mismatch" and error mesasge is emitted.
>>
>> Normally, the callee function isn't intended to exploit HTM
>> feature, but the default flag setting make it look it has.
>> As Richi raised in the PR, we have fp_expressions flag in
>> function summary, and allow us to check the function actually
>> contains any floating point expressions to avoid overkill.
>> So this patch follows the similar idea but is more target
>> specific, for this rs6000 port specific requirement on HTM
>> feature check, we would like to check rs6000 specific HTM
>> built-in functions and inline assembly, it allows targets
>> to do their own customized checks and updates.
>>
>> It introduces two target hooks need_ipa_fn_target_info and
>> update_ipa_fn_target_info.  The former allows target to do
>> some previous check and decides to collect target specific
>> information for this function or not.  For some special case,
>> it can predict the analysis result and push it early without
>> any scannings.  The latter allows the analyze_function_body
>> to pass gimple stmts down just like fp_expressions handlings,
>> target can do its own tricks.  I put them as one hook initially
>> with one boolean to indicates whether it's initial time, but
>> the code looks a bit ugly, to separate them seems to have
>> better readability.
>>
>> To make it simple, this patch uses HOST_WIDE_INT to record the
>> flags just like what we use for isa_flags.  For rs6000's HTM
>> need, one HOST_WIDE_INT variable is quite enough, but it seems
>> good to have one auto_vec for scalability as I noticed some
>> targets have more than one HOST_WIDE_INT flag.  For now, this
>> target information collection is only for always_inline function,
>> function ipa_merge_fn_summary_after_inlining deals with target
>> information merging.
>>
>> The patch has been bootstrapped and regress-tested on
>> powerpc64le-linux-gnu Power9.
>>
>> Is it on the right track?
> 
> +  if (always_inline)
> +{
> +  cgraph_node *callee_node = cgraph_node::get (callee);
> +  if (ipa_fn_summaries && ipa_fn_summaries->get (callee_node) != NULL)
> +   {
> + if (dump_file)
> +   ipa_dump_fn_summary (dump_file, callee_node);
> + const vec  =
> +   ipa_fn_summaries->get (callee_node)->target_info;
> + if (!info.is_empty ())
> +   always_inline_safe_mask |= ~info[0] & OPTION_MASK_HTM;
> +   }
> +
> +  caller_isa &= ~always_inline_safe_mask;
> +  callee_isa &= ~always_inline_safe_mask;
> +}
> 
> that's a bit convoluted but obviously the IPA info can be used for
> non-always_inline cases as well.
> 
> As said above the info can be produced for not always-inline functions
> as well, the usual case would be for LTO inlining across TUs compiled
> with different target options.  In your case the special -mcpu=power10
> TU would otherwise not be able to inline from a general -mcpu=power8 TU.
> 

Agree it can be extended to non-always_inline cases.  Since always_inline
is kind of user "forced" requirement and compiler emits error if it fails
to inline, while non-always_inline will have warning instead.  Considering
the scanning might be considered as costly for some big functions, I
guessed it might be good to start from always_inline as the first step.
But if different target options among LTO TUs is a common user case, I
think it's worth to extending it now.

> On the streaming side we possibly have to take care about the
> GPU offloading path where we likely want to avoid pushing host target
> bits to the GPU target in some way.
> 

I guess this comment is about lto_stream_offload_p, I just did some quick
checks, this flag seems to guard things into section offload_lto, while
the function summary has its own section, it seems fine? 

> Your case is specifically looking for HTM target builtins - for more general
> cases, like for example deciding whether we can inline across, say,
> -mlzcnt on x86 the 

[RFC/PATCH] ipa-inline: Add target info into fn summary [PR102059]

2021-09-01 Thread Kewen.Lin via Gcc-patches
Hi!

Power ISA 2.07 (Power8) introduces transactional memory feature
but ISA3.1 (Power10) removes it.  It exposes one troublesome
issue as PR102059 shows.  Users define some function with
target pragma cpu=power10 then it calls one function with
attribute always_inline which inherits command line option
cpu=power8 which enables HTM implicitly.  The current isa_flags
check doesn't allow this inlining due to "target specific
option mismatch" and error mesasge is emitted.

Normally, the callee function isn't intended to exploit HTM
feature, but the default flag setting make it look it has.
As Richi raised in the PR, we have fp_expressions flag in
function summary, and allow us to check the function actually
contains any floating point expressions to avoid overkill.
So this patch follows the similar idea but is more target
specific, for this rs6000 port specific requirement on HTM
feature check, we would like to check rs6000 specific HTM
built-in functions and inline assembly, it allows targets
to do their own customized checks and updates.

It introduces two target hooks need_ipa_fn_target_info and
update_ipa_fn_target_info.  The former allows target to do
some previous check and decides to collect target specific
information for this function or not.  For some special case,
it can predict the analysis result and push it early without
any scannings.  The latter allows the analyze_function_body
to pass gimple stmts down just like fp_expressions handlings,
target can do its own tricks.  I put them as one hook initially
with one boolean to indicates whether it's initial time, but
the code looks a bit ugly, to separate them seems to have
better readability.

To make it simple, this patch uses HOST_WIDE_INT to record the
flags just like what we use for isa_flags.  For rs6000's HTM
need, one HOST_WIDE_INT variable is quite enough, but it seems
good to have one auto_vec for scalability as I noticed some
targets have more than one HOST_WIDE_INT flag.  For now, this
target information collection is only for always_inline function,
function ipa_merge_fn_summary_after_inlining deals with target
information merging.

The patch has been bootstrapped and regress-tested on
powerpc64le-linux-gnu Power9.

Is it on the right track?

Any comments are highly appreciated!

BR,
Kewen
--
gcc/ChangeLog:

PR ipa/102059
* config/rs6000/rs6000-call.c (rs6000_builtin_mask_set_p): New
function.
* config/rs6000/rs6000-internal.h (rs6000_builtin_mask_set_p): New
declare.
* config/rs6000/rs6000.c (TARGET_NEED_IPA_FN_TARGET_INFO): New macro.
(TARGET_UPDATE_IPA_FN_TARGET_INFO): Likewise.
(rs6000_need_ipa_fn_target_info): New function.
(rs6000_update_ipa_fn_target_info): Likewise.
(rs6000_can_inline_p): Adjust for ipa function summary target info.
* ipa-fnsummary.c (ipa_dump_fn_summary): Adjust for ipa function
summary target info.
(analyze_function_body): Adjust for ipa function summary target
info and call hook rs6000_need_ipa_fn_target_info and
rs6000_update_ipa_fn_target_info.
(ipa_merge_fn_summary_after_inlining): Adjust for ipa function
summary target info.
(inline_read_section): Likewise.
(ipa_fn_summary_write): Likewise.
* ipa-fnsummary.h (ipa_fn_summary::target_info): New member.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_UPDATE_IPA_FN_TARGET_INFO): Document new
hook.
(TARGET_NEED_IPA_FN_TARGET_INFO): Likewise.
* target.def (update_ipa_fn_target_info): New hook.
(need_ipa_fn_target_info): Likewise.
* targhooks.c (default_need_ipa_fn_target_info): New function.
(default_update_ipa_fn_target_info): Likewise.
* targhooks.h (default_update_ipa_fn_target_info): New declare.
(default_need_ipa_fn_target_info): Likewise.

gcc/testsuite/ChangeLog:

PR ipa/102059
* gcc.dg/lto/pr102059_0.c: New test.
* gcc.dg/lto/pr102059_1.c: New test.
* gcc.dg/lto/pr102059_2.c: New test.
* gcc.target/powerpc/pr102059-5.c: New test.
* gcc.target/powerpc/pr102059-6.c: New test.

---
 gcc/config/rs6000/rs6000-call.c   | 11 +++
 gcc/config/rs6000/rs6000-internal.h   |  1 +
 gcc/config/rs6000/rs6000.c| 94 ++-
 gcc/doc/tm.texi   | 31 ++
 gcc/doc/tm.texi.in|  4 +
 gcc/ipa-fnsummary.c   | 53 +++
 gcc/ipa-fnsummary.h   |  6 +-
 gcc/testsuite/gcc.dg/lto/pr102059_0.c | 12 +++
 gcc/testsuite/gcc.dg/lto/pr102059_1.c |  9 ++
 gcc/testsuite/gcc.dg/lto/pr102059_2.c | 11 +++
 gcc/testsuite/gcc.target/powerpc/pr102059-5.c | 21 +
 gcc/testsuite/gcc.target/powerpc/pr102059-6.c | 21 +
 gcc/target.def| 35 +++
 gcc/targhooks.c  

[PATCH] rs6000: Remove useless toc-fusion option

2021-09-01 Thread Kewen.Lin via Gcc-patches
Hi!

Option toc-fusion was intended for Power9 toc fusion previously,
but Power9 doesn't support fusion at all eventually, this patch
is to remove this useless option.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.opt (-mtoc-fusion): Remove.
---
 gcc/config/rs6000/rs6000.opt | 4 
 1 file changed, 4 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 0538db387dc..a104ffa6558 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -557,10 +557,6 @@ mpower9-minmax
 Target Undocumented Mask(P9_MINMAX) Var(rs6000_isa_flags)
 Use the new min/max instructions defined in ISA 3.0.
 
-mtoc-fusion
-Target Undocumented Mask(TOC_FUSION) Var(rs6000_isa_flags)
-Fuse medium/large code model toc references with the memory instruction.
-
 mmodulo
 Target Undocumented Mask(MODULO) Var(rs6000_isa_flags)
 Generate the integer modulo instructions.
-- 
2.17.1



[PATCH] rs6000: Fix some issues in rs6000_can_inline_p [PR102059]

2021-09-01 Thread Kewen.Lin via Gcc-patches
Hi!

This patch is to fix the inconsistent behaviors for non-LTO mode
and LTO mode.  As Martin pointed out, currently the function
rs6000_can_inline_p simply makes it inlinable if callee_tree is
NULL, but it's wrong, we should use the command line options
from target_option_default_node as default.  It also replaces
rs6000_isa_flags with the one from target_option_default_node
when caller_tree is NULL as rs6000_isa_flags could probably
change since initialization.

It also extends the scope of the check for the case that callee
has explicit set options, for test case pr102059-2.c inlining can
happen unexpectedly before, it's fixed accordingly.

As Richi/Mike pointed out, some tuning flags like MASK_P8_FUSION
can be neglected for inlining, this patch also exludes them when
the callee is attributed by always_inline.

Bootstrapped and regtested on powerpc64le-linux-gnu Power9.

BR,
Kewen
-
gcc/ChangeLog:

PR ipa/102059
* config/rs6000/rs6000.c (rs6000_can_inline_p): Adjust with
target_option_default_node and consider always_inline_safe flags.

gcc/testsuite/ChangeLog:

PR ipa/102059
* gcc.target/powerpc/pr102059-1.c: New test.
* gcc.target/powerpc/pr102059-2.c: New test.
* gcc.target/powerpc/pr102059-3.c: New test.
* gcc.target/powerpc/pr102059-4.c: New test.
---
 gcc/config/rs6000/rs6000.c| 87 +++--
 gcc/testsuite/gcc.target/powerpc/pr102059-1.c | 24 +
 gcc/testsuite/gcc.target/powerpc/pr102059-2.c | 20 
 gcc/testsuite/gcc.target/powerpc/pr102059-3.c | 95 +++
 gcc/testsuite/gcc.target/powerpc/pr102059-4.c | 22 +
 5 files changed, 221 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102059-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102059-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102059-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr102059-4.c

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 46b8909104e..c2582a3efab 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -25058,45 +25058,78 @@ rs6000_generate_version_dispatcher_body (void *node_p)
 static bool
 rs6000_can_inline_p (tree caller, tree callee)
 {
-  bool ret = false;
   tree caller_tree = DECL_FUNCTION_SPECIFIC_TARGET (caller);
   tree callee_tree = DECL_FUNCTION_SPECIFIC_TARGET (callee);
 
-  /* If the callee has no option attributes, then it is ok to inline.  */
+  /* If the caller/callee has option attributes, then use them.
+ Otherwise, use the command line options.  */
   if (!callee_tree)
-ret = true;
+callee_tree = target_option_default_node;
+  if (!caller_tree)
+caller_tree = target_option_default_node;
+
+  struct cl_target_option *caller_opts = TREE_TARGET_OPTION (caller_tree);
+  struct cl_target_option *callee_opts = TREE_TARGET_OPTION (callee_tree);
+  HOST_WIDE_INT caller_isa = caller_opts->x_rs6000_isa_flags;
+  HOST_WIDE_INT callee_isa = callee_opts->x_rs6000_isa_flags;
+
+  bool always_inline =
+(DECL_DISREGARD_INLINE_LIMITS (callee)
+ && lookup_attribute ("always_inline", DECL_ATTRIBUTES (callee)));
+
+  /* Some flags such as fusion can be tolerated for always inlines.  */
+  unsigned HOST_WIDE_INT always_inline_safe_mask =
+(MASK_P8_FUSION | MASK_P10_FUSION | OPTION_MASK_SAVE_TOC_INDIRECT
+ | OPTION_MASK_P8_FUSION_SIGN | OPTION_MASK_P10_FUSION_LD_CMPI
+ | OPTION_MASK_P10_FUSION_2LOGICAL | OPTION_MASK_P10_FUSION_LOGADD
+ | OPTION_MASK_P10_FUSION_ADDLOG | OPTION_MASK_P10_FUSION_2ADD
+ | OPTION_MASK_PCREL_OPT);
+
+  if (always_inline) {
+caller_isa &= ~always_inline_safe_mask;
+callee_isa &= ~always_inline_safe_mask;
+  }
 
-  else
+  /* The callee's options must be a subset of the caller's options, i.e.
+ a vsx function may inline an altivec function, but a no-vsx function
+ must not inline a vsx function.  */
+  if ((caller_isa & callee_isa) != callee_isa)
 {
-  HOST_WIDE_INT caller_isa;
-  struct cl_target_option *callee_opts = TREE_TARGET_OPTION (callee_tree);
-  HOST_WIDE_INT callee_isa = callee_opts->x_rs6000_isa_flags;
-  HOST_WIDE_INT explicit_isa = callee_opts->x_rs6000_isa_flags_explicit;
-
-  /* If the caller has option attributes, then use them.
-Otherwise, use the command line options.  */
-  if (caller_tree)
-   caller_isa = TREE_TARGET_OPTION (caller_tree)->x_rs6000_isa_flags;
-  else
-   caller_isa = rs6000_isa_flags;
+  if (TARGET_DEBUG_TARGET)
+   fprintf (stderr,
+"rs6000_can_inline_p:, caller %s, callee %s, cannot "
+"inline since callee's options set isn't a subset of "
+"caller's options set.\n",
+get_decl_name (caller), get_decl_name (callee));
+  return false;
+}
 
-  /* The callee's options must be a subset of the caller's options, i.e.
-a vsx function 

Re: [PATCH] rs6000: Add missing unsigned info for some P10 bifs

2021-08-29 Thread Kewen.Lin via Gcc-patches
on 2021/8/11 下午1:44, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> This patch is to make prototypes of some Power10 built-in
> functions consistent with what's in the documentation, as
> well as the vector version.  Otherwise, useless conversions
> can be generated in gimple IR, and the vectorized versions
> will have inconsistent types.
> 
> Bootstrapped & regtested on powerpc64le-linux-gnu P9 and
> powerpc64-linux-gnu P8.
> 
> Is it ok for trunk?
> 

This has been approved by Segher offline, thanks Segher!

Committed in r12-3179.

BR,
Kewen

> BR,
> Kewen
> -
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-call.c (builtin_function_type): Add unsigned
>   signedness for some Power10 bifs.
> 




Re: [PATCH, rs6000] Disable gimple fold for float or double vec_minmax when fast-math is not set

2021-08-25 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2021/8/25 下午3:06, HAO CHEN GUI via Gcc-patches wrote:
> Hi,
> 
>     I refined the patch according to Bill's advice. I pasted the ChangeLog 
> and diff file here. If it doesn't work, please let me know. Thanks.
> 
> 2021-08-25 Haochen Gui 
> 
> gcc/

IIUC, this patch is for PR93127, one line for PR is missing here.

>     * config/rs6000/rs6000-call.c (rs6000_gimple_fold_builtin):
>     Modify the VSX_BUILTIN_XVMINDP, ALTIVEC_BUILTIN_VMINFP,
>     VSX_BUILTIN_XVMAXDP, ALTIVEC_BUILTIN_VMAXFP expansions.
> 
> gcc/testsuite/

Same, need a PR line.

>     * gcc.target/powerpc/vec-minmax-1.c: New test.
>     * gcc.target/powerpc/vec-minmax-2.c: Likewise.
> 

Maybe it's better to use pr93127-{1,2}.c for case names?

...
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-minmax-1.c
> @@ -0,0 +1,53 @@
> +/* { dg-do compile { target { powerpc64le-*-* } } } */

I guess this "powerpc64le" isn't intentional?  The test case
has the macro to distinguish endianess, I assume we want this
to be compiled on BE?  If so, we just put the line below instead?

/* { dg-do compile } */

And it needs extra testing on BE as well.  :)

Thanks for fixing this!

BR,
Kewen

> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
> +/* { dg-final { scan-assembler-times {\mxvmaxdp\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxvmaxsp\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxvmindp\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxvminsp\M} 1 } } */
> +
> +/* This test verifies that float or double vec_min/max are bound to
> +   xv[min|max][d|s]p instructions when fast-math is not set.  */
> +
> +
> +#include 
> +
> +#ifdef _BIG_ENDIAN
> +   const int PREF_D = 0;
> +#else
> +   const int PREF_D = 1;
> +#endif
> +
> +double vmaxd (double a, double b)
> +{
> +  vector double va = vec_promote (a, PREF_D);
> +  vector double vb = vec_promote (b, PREF_D);
> +  return vec_extract (vec_max (va, vb), PREF_D);
> +}
> +
> +double vmind (double a, double b)
> +{
> +  vector double va = vec_promote (a, PREF_D);
> +  vector double vb = vec_promote (b, PREF_D);
> +  return vec_extract (vec_min (va, vb), PREF_D);
> +}
> +
> +#ifdef _BIG_ENDIAN
> +   const int PREF_F = 0;
> +#else
> +   const int PREF_F = 3;
> +#endif
> +
> +float vmaxf (float a, float b)
> +{
> +  vector float va = vec_promote (a, PREF_F);
> +  vector float vb = vec_promote (b, PREF_F);
> +  return vec_extract (vec_max (va, vb), PREF_F);
> +}
> +
> +float vminf (float a, float b)
> +{
> +  vector float va = vec_promote (a, PREF_F);
> +  vector float vb = vec_promote (b, PREF_F);
> +  return vec_extract (vec_min (va, vb), PREF_F);
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-minmax-2.c 
> b/gcc/testsuite/gcc.target/powerpc/vec-minmax-2.c
> new file mode 100644
> index 000..d318b933181
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-minmax-2.c
> @@ -0,0 +1,51 @@
> +/* { dg-do compile { target { powerpc64le-*-* } } } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power9 -ffast-math" } */
> +/* { dg-final { scan-assembler-times {\mxsmaxcdp\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mxsmincdp\M} 2 } } */
> +
> +/* This test verifies that float or double vec_min/max can be converted
> +   to scalar comparison when fast-math is set.  */
> +
> +
> +#include 
> +
> +#ifdef _BIG_ENDIAN
> +   const int PREF_D = 0;
> +#else
> +   const int PREF_D = 1;
> +#endif
> +
> +double vmaxd (double a, double b)
> +{
> +  vector double va = vec_promote (a, PREF_D);
> +  vector double vb = vec_promote (b, PREF_D);
> +  return vec_extract (vec_max (va, vb), PREF_D);
> +}
> +
> +double vmind (double a, double b)
> +{
> +  vector double va = vec_promote (a, PREF_D);
> +  vector double vb = vec_promote (b, PREF_D);
> +  return vec_extract (vec_min (va, vb), PREF_D);
> +}
> +
> +#ifdef _BIG_ENDIAN
> +   const int PREF_F = 0;
> +#else
> +   const int PREF_F = 3;
> +#endif
> +
> +float vmaxf (float a, float b)
> +{
> +  vector float va = vec_promote (a, PREF_F);
> +  vector float vb = vec_promote (b, PREF_F);
> +  return vec_extract (vec_max (va, vb), PREF_F);
> +}
> +
> +float vminf (float a, float b)
> +{
> +  vector float va = vec_promote (a, PREF_F);
> +  vector float vb = vec_promote (b, PREF_F);
> +  return vec_extract (vec_min (va, vb), PREF_F);
> +}
> 


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Kewen.Lin via Gcc-patches
on 2021/8/25 上午6:14, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Aug 13, 2021 at 10:34:46AM +0800, Kewen.Lin wrote:
>> on 2021/8/12 下午11:10, Segher Boessenkool wrote:
>>>> +  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
>>>> +{
>>>> +  machine_mode exp_mode = DImode;
>>>> +  machine_mode exp_vmode = V2DImode;
>>>> +  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
>>>
>>> "name"?  This should be "bif" or similar?
>>
>> Updated with name.
> 
> No, I meant "name" has no meaning other than it is wrong here :-)
> 
> It is an enum for which builtin to use here.  It has nothing to do with
> a name.  So it could be "enum rs6000_builtins bif" or whatever you want;
> short variable names are *good*, for many reasons, but they should not
> egregiously lie :-)
> 

Oops, sorry for the misunderstanding, will update it with "bif".

>>>> +/* { dg-do run } */
>>>> +/* { dg-require-effective-target lp64 } */
>>>
>>> Same here.  I suppose this uses builtins that do not exist on 32-bit?
>>
>> Yeah, those bifs which are guarded with lp64 in their cases are only
>> supported on 64-bit environment.
> 
> It is a pity we cannot use "powerpc64" here (that selector does not test
> what you would/could/should hope it tests...  Maybe someone can fix it
> some day?  The only real blocker to that is fixing up the current users
> of it, the rest is easy).
> 

If I got it right, there is only one test case using this selector:

gcc/testsuite/gcc.target/powerpc/darwin-longlong.c

The selector checking looks interesting to me, it has special option with
"-mcpu=G5" and seems to exclude all "aix" (didn't verify it yet).

I guess there still would be some efforts to re-direct those existing
cases which should use new "powerpc64_ok" instead of "lp64"?

> Expanding a bit...  You would expect (well, I do!  I did patches
> expecting this several times) this to mean "powerpc64_ok", but it in
> fact means "powerpc64_hw".  Maybe we should have selectors with those
> two names, and get rid of the current "powerpc64"?
> 

Yeah, it sounds good to have those two names just like some existing.

>>>> +#define CHECK(name)   
>>>> \
>>>> +  __attribute__ ((optimize (1))) void check_##name () 
>>>> \
>>>
>>> What is the attribute for, btw?  It seems fragile, but perhaps I do not
>>> understand the intention.
>>
>> It's to stop compiler from optimizing check functions with vectorization,
>> since the test point is to compare the results between scalar and vectorized
>> version.
> 
> So, add a comment for this as well please.
> 
> In general, in testcases you can do the dirtiest things, no problems at
> all, just document what you do why :-)
> 

OK, will add.

>> Thanks, v2 has been attached by addressing Bill's and your comments.  :)
> 
> Looks good.  Just fix that "name" thing, and it is okay for trunk.
> Thanks!
> 

Thanks for the review!

BR,
Kewen


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-24 Thread Kewen.Lin via Gcc-patches
on 2021/8/25 上午5:56, Segher Boessenkool wrote:
> On Fri, Aug 13, 2021 at 11:18:46AM +0800, Kewen.Lin wrote:
>> on 2021/8/12 下午11:51, Segher Boessenkool wrote:
>>> It is a bad idea to initialise things unnecessary: it hinders many
>>> optimisations, but much more importantly, it silences warnings without
>>> fixing the problem.
>>
>> OK, I've made it uninitialized in v2. :-)  I believe the context here is 
>> simple
>> and the uninit-ed var detector can easily catch and warn the bad thing in 
>> future.
> 
> And those warnings generally are for "MAY BE used uninitialised",
> anyway.  They will warn :-)
> 
> (When the warning says "IS used uninitialised" the compiler should be
> sure about that!)
> 
>> Sorry for chasing dead ends, I don't follow how it can hinder optimizations 
>> here,
>> IIUC it would be optimized as a dead store here?
> 
> When the compiler is not sure if something needs initialisation or not
> it cannot remove actually superfluous initialisation.  Such cases are
> always too complicated code, so that should be fixed, not silenced :-)
> 

aha, you meant complicated code, got it.  :)

>> As to the warning, although
>> there is no warning, I'd expect it causes ICE since the init-ed bif name 
>> isn't
>> reasonable for generation.  Wouldn't it be better than warning?  Sometimes we
>> don't have a proper value for initialization, I agree it should be better to
>> just leave it be, but IMHO it isn't the case here.  :)
> 
> ICEing is always wrong.  A user should never see an ICE (not counting
> "sorry"s as ICEs here -- not that those are good, but they tell the user
> exactly what is going on).
> 

Yeah, but here I was expecting the ICE happens when GCC developers are testing
the newly added bif supports.  :)


BR,
Kewen


Re: [PATCH v2] rs6000: Add vec_unpacku_{hi,lo}_v4si

2021-08-24 Thread Kewen.Lin via Gcc-patches
on 2021/8/24 下午9:02, Segher Boessenkool wrote:
> Hi Ke Wen,
> 
> On Mon, Aug 09, 2021 at 10:53:00AM +0800, Kewen.Lin wrote:
>> on 2021/8/6 下午9:10, Bill Schmidt wrote:
>>> On 8/4/21 9:06 PM, Kewen.Lin wrote:
>>>> The existing vec_unpacku_{hi,lo} supports emulated unsigned
>>>> unpacking for short and char but misses the support for int.
>>>> This patch adds the support for vec_unpacku_{hi,lo}_v4si.
> 
>>  * config/rs6000/altivec.md (vec_unpacku_hi_v16qi): Remove.
>>  (vec_unpacku_hi_v8hi): Likewise.
>>  (vec_unpacku_lo_v16qi): Likewise.
>>  (vec_unpacku_lo_v8hi): Likewise.
>>  (vec_unpacku_hi_): New define_expand.
>>  (vec_unpacku_lo_): Likewise.
> 
>> -(define_expand "vec_unpacku_hi_v16qi"
>> -  [(set (match_operand:V8HI 0 "register_operand" "=v")
>> -(unspec:V8HI [(match_operand:V16QI 1 "register_operand" "v")]
>> - UNSPEC_VUPKHUB))]
>> -  "TARGET_ALTIVEC"  
>> -{  
>> -  rtx vzero = gen_reg_rtx (V8HImode);
>> -  rtx mask = gen_reg_rtx (V16QImode);
>> -  rtvec v = rtvec_alloc (16);
>> -  bool be = BYTES_BIG_ENDIAN;
>> -   
>> -  emit_insn (gen_altivec_vspltish (vzero, const0_rtx));
>> -   
>> -  RTVEC_ELT (v,  0) = gen_rtx_CONST_INT (QImode, be ? 16 :  7);
>> -  RTVEC_ELT (v,  1) = gen_rtx_CONST_INT (QImode, be ?  0 : 16);
>> -  RTVEC_ELT (v,  2) = gen_rtx_CONST_INT (QImode, be ? 16 :  6);
>> -  RTVEC_ELT (v,  3) = gen_rtx_CONST_INT (QImode, be ?  1 : 16);
>> -  RTVEC_ELT (v,  4) = gen_rtx_CONST_INT (QImode, be ? 16 :  5);
>> -  RTVEC_ELT (v,  5) = gen_rtx_CONST_INT (QImode, be ?  2 : 16);
>> -  RTVEC_ELT (v,  6) = gen_rtx_CONST_INT (QImode, be ? 16 :  4);
>> -  RTVEC_ELT (v,  7) = gen_rtx_CONST_INT (QImode, be ?  3 : 16);
>> -  RTVEC_ELT (v,  8) = gen_rtx_CONST_INT (QImode, be ? 16 :  3);
>> -  RTVEC_ELT (v,  9) = gen_rtx_CONST_INT (QImode, be ?  4 : 16);
>> -  RTVEC_ELT (v, 10) = gen_rtx_CONST_INT (QImode, be ? 16 :  2);
>> -  RTVEC_ELT (v, 11) = gen_rtx_CONST_INT (QImode, be ?  5 : 16);
>> -  RTVEC_ELT (v, 12) = gen_rtx_CONST_INT (QImode, be ? 16 :  1);
>> -  RTVEC_ELT (v, 13) = gen_rtx_CONST_INT (QImode, be ?  6 : 16);
>> -  RTVEC_ELT (v, 14) = gen_rtx_CONST_INT (QImode, be ? 16 :  0);
>> -  RTVEC_ELT (v, 15) = gen_rtx_CONST_INT (QImode, be ?  7 : 16);
>> -
>> -  emit_insn (gen_vec_initv16qiqi (mask, gen_rtx_PARALLEL (V16QImode, v)));
>> -  emit_insn (gen_vperm_v16qiv8hi (operands[0], operands[1], vzero, mask));
>> -  DONE;
>> -})
> 
> So I wonder if all this still generates good code.  The unspecs cannot
> be optimised properly, the RTL can (in principle, anyway: it is possible
> it makes more opportunities to use unpack etc. insns invisible than that
> it helps over unspec.  This needs to be tested, and the usual idioms
> need testcases, is that what you add here?  (/me reads on...)
> 

Yeah, for existing char/short, it generates better codes with vector
merging high/low instead of permutation, by saving the cost for the
permutation control vector (space in constant area as well as the cost
to initialize it in prologue).  The iterator writing makes it concise
and also add the missing "int" support.  The associated test cases are
to verify new generated assembly and runtime result.

>> +  if (BYTES_BIG_ENDIAN)
>> +emit_insn (gen_altivec_vmrgh (res, vzero, op1));
>> +  else
>> +emit_insn (gen_altivec_vmrgl (res, op1, vzero));
> 
> Ah, so it is *not* using unspecs?  Excellent.
> 
> Okay for trunk.  Thank you!
> 

Thanks for the review!  Committed in r12-3134.


BR,
Kewen


Re: [PATCH][v2] Remove --param vect-inner-loop-cost-factor

2021-08-23 Thread Kewen.Lin via Gcc-patches
Hi Richi,

on 2021/8/23 下午10:33, Richard Biener via Gcc-patches wrote:
> This removes --param vect-inner-loop-cost-factor in favor of looking
> at the estimated number of iterations of the inner loop
> when available and otherwise just assumes a single inner
> iteration which is conservative on the side of not vectorizing.
> 

I may miss something, the factor seems to be an amplifier, a single
inner iteration on the side of not vectorizing only relies on that
vector_cost < scalar_cost, if scalar_cost < vector_cost, the direction
will be flipped? ({vector,scalar}_cost is only for inner loop part).

Since we don't calculate/compare costing for inner loop independently
and early return if scalar_cost < vector_cost for inner loop, I guess
it's possible to have "scalar_cost < vector_cost" case theoretically,
especially when targets can cost something more on vector side.

> The alternative is to retain the --param for exactly that case,
> not sure if the result is better or not.  The --param is new on
> head, it was static '50' before.
> 

I think the intention of --param is to offer ports a way to tweak
it (no ports do it for now though :)).  Not sure how target costing
is sensitive to this factor, but I also prefer to make its default
value as 50 as Honza suggested to avoid more possible tweakings.

If targets want more, maybe we can extend it to:

default_hook:
  return estimated or likely_max if either is valid;
  return default value;
  
target hook:
  val = default_hook; // or from scratch
  tweak the val as it wishes;  

I guess there is no this need for now.

> Any strong opinions?
> 
> Richard.
> 
> 2021-08-23  Richard Biener  
> 
>   * doc/invoke.texi (vect-inner-loop-cost-factor): Remove
>   documentation.
>   * params.opt (--param vect-inner-loop-cost-factor): Remove.
>   * tree-vect-loop.c (_loop_vec_info::_loop_vec_info):
>   Initialize inner_loop_cost_factor to 1.
>   (vect_analyze_loop_form): Initialize inner_loop_cost_factor
>   from the estimated number of iterations of the inner loop.
> ---
>  gcc/doc/invoke.texi  |  5 -
>  gcc/params.opt   |  4 
>  gcc/tree-vect-loop.c | 12 +++-
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index c057cc1e4ae..054950132f6 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -14385,11 +14385,6 @@ code to iterate.  2 allows partial vector loads and 
> stores in all loops.
>  The parameter only has an effect on targets that support partial
>  vector loads and stores.
>  
> -@item vect-inner-loop-cost-factor
> -The factor which the loop vectorizer applies to the cost of statements
> -in an inner loop relative to the loop being vectorized.  The default
> -value is 50.
> -
>  @item avoid-fma-max-bits
>  Maximum number of bits for which we avoid creating FMAs.
>  
> diff --git a/gcc/params.opt b/gcc/params.opt
> index f9264887b40..f7b19fa430d 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1113,8 +1113,4 @@ Bound on number of runtime checks inserted by the 
> vectorizer's loop versioning f
>  Common Joined UInteger Var(param_vect_partial_vector_usage) Init(2) 
> IntegerRange(0, 2) Param Optimization
>  Controls how loop vectorizer uses partial vectors.  0 means never, 1 means 
> only for loops whose need to iterate can be removed, 2 means for all loops.  
> The default value is 2.
>  
> --param=vect-inner-loop-cost-factor=
> -Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) 
> IntegerRange(1, 99) Param Optimization
> -The factor which the loop vectorizer applies to the cost of statements in an 
> inner loop relative to the loop being vectorized.
> -
>  ; This comment is to ensure we retain the blank line above.
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index c521b43a47c..cb48717f20e 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -841,7 +841,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> vec_info_shared *shared)
>  single_scalar_iteration_cost (0),
>  vec_outside_cost (0),
>  vec_inside_cost (0),
> -inner_loop_cost_factor (param_vect_inner_loop_cost_factor),
> +inner_loop_cost_factor (1),
>  vectorizable (false),
>  can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
>  using_partial_vectors_p (false),
> @@ -1519,6 +1519,16 @@ vect_analyze_loop_form (class loop *loop, 
> vec_info_shared *shared)
>stmt_vec_info inner_loop_cond_info
>   = loop_vinfo->lookup_stmt (inner_loop_cond);
>STMT_VINFO_TYPE (inner_loop_cond_info) = loop_exit_ctrl_vec_info_type;
> +  /* If we have an estimate on the number of iterations of the inner
> +  loop use that as the scale for costing, otherwise conservatively
> +  assume a single inner iteration.  */
> +  widest_int nit;
> +  if (get_estimated_loop_iterations (loop->inner, ))
> + LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo)

Re: [PATCH] more warning code refactoring

2021-08-19 Thread Kewen.Lin via Gcc-patches
Hi Martin,

on 2021/8/20 上午12:30, Martin Sebor wrote:
> On 8/19/21 9:03 AM, Martin Sebor wrote:
>> On 8/18/21 11:56 PM, Kewen.Lin wrote:
>>> Hi David,
>>>
>>> on 2021/8/19 上午11:26, David Edelsohn via Gcc-patches wrote:
>>>> Hi, Martin
>>>>
>>>> A few PowerPC-specific testcases started failing yesterday on AIX with
>>>> a strange failure mode: the compiler runs out of memory.  As you may
>>>> expect from telling you this in an email reply to your patch, I have
>>>> bisected the failure and landed on your commit.  I can alternate
>>>> between the previous commit and your commit, and the failure
>>>> definitely appears with your patch, although I'm unsure how your patch
>>>> affected memory allocation in the compiler.  Maybe moving the code
>>>> changed a type of allocation or some memory no longer is being freed?
>>>>
>>>
>>>
>>> To get rid of GTY variable alloc_object_size_limit looks suspicious,
>>> maybe tree objects returned by alloc_max_size after the change are out
>>> of GC's tracking?
>>>
>>> If the suspicion holds, the attached explorative diff may help.
>>
>> I wouldn't expect that to make a difference.  There are thousands
>> of similar calls to build_int_cst() throughout the middle end.
>>
>> Looking at the original patch, the change that I'm not sure about
>> and that shouldn't have been part of the refactoring is the call
>> to enable_ranger() in pass_waccess::execute().  It's something
>> I was planning to do next.  But even that I wouldn't expect to
>> eat up a whole 1GB or memory.
> 
> I have reproduced the excessive memory consumption with
> the rlwimi-0.c test and a powerpc-linux cross-compiler, and
> confirmed that it is indeed caused by the call to enable_ranger().
> The test defines some six thousand functions so it seems that
> unless each call enable_ranger() is paired with some call to
> release the memory it allocates the memory leaks.
> 
> The removal of the alloc_object_size_limit global variable doesn't
> have any effect on the test case.  The function that used it (and
> now calls build_int_cst () instead) isn't called when the test
> is compiled  (It's only called for calls to allocation functions
> in the source and the test case has none.
> 

Thanks for the clarification and sorry for noisy suspicion!

BR,
Kewen

> Let me take care of releasing the ranger memory.
> 
> Martin
> 
> 
>>
>>>
>>> BR,
>>> Kewen
>>>
>>>> Previously, compiler bootstrap and all testcases ran with a data size
>>>> of 1GB.  After your change, the data size required for those
>>>> particular testcases jumped to 2GB.
>>>>
>>>> The testcases are
>>>>
>>>> gcc/testsuite/gcc.target/powerpc/rlwimi-[012].c
>>>>
>>>> The failure is
>>>>
>>>> cc1: out of memory allocating 65536 bytes after a total of 1608979296
>>>>
>>>> This seems like a significant memory use regression.  Any ideas what 
>>>> happened?
>>
>> Not really.  The patch just moved code around.  I didn't make any
>> changes that I'd expect to impact memory allocation to an appreciable
>> extent, at least not intentionally.  Let me look into it and get back
>> to you.
>>
>> Martin
>>
>>>>
>>>> Thanks, David
>>>>
>>
> 



Re: [PATCH] more warning code refactoring

2021-08-18 Thread Kewen.Lin via Gcc-patches
Hi David,

on 2021/8/19 上午11:26, David Edelsohn via Gcc-patches wrote:
> Hi, Martin
> 
> A few PowerPC-specific testcases started failing yesterday on AIX with
> a strange failure mode: the compiler runs out of memory.  As you may
> expect from telling you this in an email reply to your patch, I have
> bisected the failure and landed on your commit.  I can alternate
> between the previous commit and your commit, and the failure
> definitely appears with your patch, although I'm unsure how your patch
> affected memory allocation in the compiler.  Maybe moving the code
> changed a type of allocation or some memory no longer is being freed?
> 


To get rid of GTY variable alloc_object_size_limit looks suspicious,
maybe tree objects returned by alloc_max_size after the change are out
of GC's tracking?

If the suspicion holds, the attached explorative diff may help.

BR,
Kewen

> Previously, compiler bootstrap and all testcases ran with a data size
> of 1GB.  After your change, the data size required for those
> particular testcases jumped to 2GB.
> 
> The testcases are
> 
> gcc/testsuite/gcc.target/powerpc/rlwimi-[012].c
> 
> The failure is
> 
> cc1: out of memory allocating 65536 bytes after a total of 1608979296
> 
> This seems like a significant memory use regression.  Any ideas what happened?
> 
> Thanks, David
> 
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 6653e9e2142..9aefed47be8 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -2717,7 +2717,7 @@ GTFILES = $(CPPLIB_H) $(srcdir)/input.h 
$(srcdir)/coretypes.h \
   $(srcdir)/sancov.c \
   $(srcdir)/ipa-devirt.c \
   $(srcdir)/internal-fn.h \
-  $(srcdir)/calls.c \
+  $(srcdir)/gimple-ssa-warn-access.cc \
   $(srcdir)/omp-general.h \
   @all_gtfiles@
 
diff --git a/gcc/gimple-ssa-warn-access.cc b/gcc/gimple-ssa-warn-access.cc
index f3efe564af0..267ef987edd 100644
--- a/gcc/gimple-ssa-warn-access.cc
+++ b/gcc/gimple-ssa-warn-access.cc
@@ -2301,6 +2301,9 @@ pass_waccess::gate (function *)
  || warn_mismatched_new_delete);
 }
 
+/* The limit set by -Walloc-larger-than=.  */
+static GTY(()) tree alloc_object_size_limit;
+
 /* Initialize ALLOC_OBJECT_SIZE_LIMIT based on the -Walloc-size-larger-than=
setting if the option is specified, or to the maximum object size if it
is not.  Return the initialized value.  */
@@ -2308,11 +2311,16 @@ pass_waccess::gate (function *)
 static tree
 alloc_max_size (void)
 {
+  if (alloc_object_size_limit)
+return alloc_object_size_limit;
+
   HOST_WIDE_INT limit = warn_alloc_size_limit;
   if (limit == HOST_WIDE_INT_MAX)
 limit = tree_to_shwi (TYPE_MAX_VALUE (ptrdiff_type_node));
 
-  return build_int_cst (size_type_node, limit);
+  alloc_object_size_limit = build_int_cst (size_type_node, limit);
+
+  return alloc_object_size_limit;
 }
 
 /* Diagnose a call EXP to function FN decorated with attribute alloc_size
@@ -3328,3 +3336,6 @@ make_pass_warn_access (gcc::context *ctxt)
 {
   return new pass_waccess (ctxt);
 }
+
+/* Tell the garbage collector about GTY markers in this source file.  */
+#include "gt-gimple-ssa-warn-access.h"


Re: [PATCH] vect: Add extraction cost for slp reduc

2021-08-16 Thread Kewen.Lin via Gcc-patches
Hi Richi,

Thanks for the comments!

on 2021/8/16 下午2:49, Richard Biener wrote:
> On Mon, Aug 16, 2021 at 8:03 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> IIUC, the function vectorizable_bb_reduc_epilogue missed to
>> consider the cost to extract the final value from the vector
>> for reduc operations.  This patch is to add one time of
>> vec_to_scalar cost for extracting.
>>
>> Bootstrapped & regtested on powerpc64le-linux-gnu P9.
>> The testing on x86_64 and aarch64 is ongoing.
>>
>> Is it ok for trunk?
> 
> There's no such instruction necessary, the way the costing works
> the result is in lane zero already.  Note the optabs are defined
> to reduce to a scalar already.  So if your arch implements those and
> requires such move then the backend costing needs to handle that.
> 

Yes, these reduc__scal_ should have made the
operand[0] as the final scalar result.

> That said, ideally we'd simply cost the IFN_REDUC_* in the backend
> but for BB reductions we don't actually build a SLP node with such
> representative stmt to pass down (yet).
> 

OK, thanks for the explanation.  It explains why we cost the 
IFN_REDUC_* as one vect_stmt in loop vect but cost it as
conservative (shuffle and reduc_op) as possible here.

> I guess you're running into a integer reduction where there's
> a vector -> gpr move missing in costing?  I suppose costing
> vec_to_scalar works for that but in the end we should maybe
> find a way to cost the IFN_REDUC_* ...

Yeah, it's a reduction on plus, initially I wanted to adjust backend
costing for various IFN_REDUC* (since for some variants Power has more
than one instructions for them), then I noticed we cost the reduction
as shuffle and reduc_op during SLP for now, I guess it's good to get
vec_to_scalar considered here for consistency?  Then it can be removed
together when we have a better modeling in the end? 

BR,
Kewen

> 
> Richard.
> 
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> * tree-vect-slp.c (vectorizable_bb_reduc_epilogue): Add the cost for
>> value extraction.
>>
>> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
>> index b9d88c2d943..841a0872afa 100644
>> --- a/gcc/tree-vect-slp.c
>> +++ b/gcc/tree-vect-slp.c
>> @@ -4845,12 +4845,14 @@ vectorizable_bb_reduc_epilogue (slp_instance 
>> instance,
>>  return false;
>>
>>/* There's no way to cost a horizontal vector reduction via REDUC_FN so
>> - cost log2 vector operations plus shuffles.  */
>> + cost log2 vector operations plus shuffles and one extraction.  */
>>unsigned steps = floor_log2 (vect_nunits_for_cost (vectype));
>>record_stmt_cost (cost_vec, steps, vector_stmt, instance->root_stmts[0],
>> vectype, 0, vect_body);
>>record_stmt_cost (cost_vec, steps, vec_perm, instance->root_stmts[0],
>> vectype, 0, vect_body);
>> +  record_stmt_cost (cost_vec, 1, vec_to_scalar, instance->root_stmts[0],
>> +   vectype, 0, vect_body);
>>return true;
>>  }




[PATCH] vect: Add extraction cost for slp reduc

2021-08-16 Thread Kewen.Lin via Gcc-patches
Hi,

IIUC, the function vectorizable_bb_reduc_epilogue missed to
consider the cost to extract the final value from the vector
for reduc operations.  This patch is to add one time of
vec_to_scalar cost for extracting.

Bootstrapped & regtested on powerpc64le-linux-gnu P9.
The testing on x86_64 and aarch64 is ongoing.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* tree-vect-slp.c (vectorizable_bb_reduc_epilogue): Add the cost for
value extraction.

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index b9d88c2d943..841a0872afa 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -4845,12 +4845,14 @@ vectorizable_bb_reduc_epilogue (slp_instance instance,
 return false;

   /* There's no way to cost a horizontal vector reduction via REDUC_FN so
- cost log2 vector operations plus shuffles.  */
+ cost log2 vector operations plus shuffles and one extraction.  */
   unsigned steps = floor_log2 (vect_nunits_for_cost (vectype));
   record_stmt_cost (cost_vec, steps, vector_stmt, instance->root_stmts[0],
vectype, 0, vect_body);
   record_stmt_cost (cost_vec, steps, vec_perm, instance->root_stmts[0],
vectype, 0, vect_body);
+  record_stmt_cost (cost_vec, 1, vec_to_scalar, instance->root_stmts[0],
+   vectype, 0, vect_body);
   return true;
 }


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-12 Thread Kewen.Lin via Gcc-patches
on 2021/8/12 下午11:51, Segher Boessenkool wrote:
> On Thu, Aug 12, 2021 at 10:10:10AM +0800, Kewen.Lin wrote:
>>> +  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
>>>
>>> Using this as a flag value looks unnecessary.  Is this just being done to 
>>> silence a warning?
>>
>> Good question!  I didn't notice there is a warning or not, just get used to 
>> initializing variable
>> with one suitable value if possible.  If you don't mind, may I still keep 
>> it?  Since if some
>> future codes use vname in a path where it's not assigned, one explicitly 
>> wrong enum (bif) seems
>> better than a random one.  Or will this mentioned possibility definitely 
>> never happen since the
>> current uninitialized variables detection and warning scheme is robust and 
>> should not worry about
>> that completely?
> 
> It is a bad idea to initialise things unnecessary: it hinders many
> optimisations, but much more importantly, it silences warnings without
> fixing the problem.
> 

OK, I've made it uninitialized in v2. :-)  I believe the context here is simple
and the uninit-ed var detector can easily catch and warn the bad thing in 
future.

Sorry for chasing dead ends, I don't follow how it can hinder optimizations 
here,
IIUC it would be optimized as a dead store here?  As to the warning, although
there is no warning, I'd expect it causes ICE since the init-ed bif name isn't
reasonable for generation.  Wouldn't it be better than warning?  Sometimes we
don't have a proper value for initialization, I agree it should be better to
just leave it be, but IMHO it isn't the case here.  :)


BR,
Kewen

>>> +  if (vname != RS6000_BUILTIN_COUNT
>>>
>>> Check is not necessary, as you will have returned by now in that case.
>>
>> Thanks for catching, I put break for "default" initially, didn't noticed the 
>> following condition
>> need an adjustment after updating it to early return.  Will fix it.
> 
> Thanks :-)
> 
> 
> Segher
> 


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-12 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the review!

on 2021/8/12 下午11:10, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Aug 11, 2021 at 02:56:11PM +0800, Kewen.Lin wrote:
>>  * config/rs6000/rs6000.c (rs6000_builtin_md_vectorized_function): Add
>>  support for some built-in functions vectorized on Power10.
> 
> Say which, not "some" please?
> 

Done.

>> +  machine_mode in_vmode = TYPE_MODE (type_in);
>> +  machine_mode out_vmode = TYPE_MODE (type_out);
>> +
>> +  /* Power10 supported vectorized built-in functions.  */
>> +  if (TARGET_POWER10
>> +  && in_vmode == out_vmode
>> +  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
>> +{
>> +  machine_mode exp_mode = DImode;
>> +  machine_mode exp_vmode = V2DImode;
>> +  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
> 
> "name"?  This should be "bif" or similar?
> 

Updated with name.

>> +  switch (fn)
>> +{
>> +case MISC_BUILTIN_DIVWE:
>> +case MISC_BUILTIN_DIVWEU:
>> +  exp_mode = SImode;
>> +  exp_vmode = V4SImode;
>> +  if (fn == MISC_BUILTIN_DIVWE)
>> +vname = P10V_BUILTIN_DIVES_V4SI;
>> +  else
>> +vname = P10V_BUILTIN_DIVEU_V4SI;
>> +  break;
>> +case MISC_BUILTIN_DIVDE:
>> +case MISC_BUILTIN_DIVDEU:
>> +  if (fn == MISC_BUILTIN_DIVDE)
>> +vname = P10V_BUILTIN_DIVES_V2DI;
>> +  else
>> +vname = P10V_BUILTIN_DIVEU_V2DI;
>> +  break;
> 
> All of the above should not be builtin functions really, they are all
> simple arithmetic :-(  They should not be UNSPECs either, on RTL level.
> They can and should be optimised in real code as well.  Oh well.
> 
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/dive-vectorize-2.c
>> @@ -0,0 +1,12 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target lp64 } */
> 
> Please add a comment what this is needed for?  "We scan for dive*d" is
> enough, but without anything, it takes time to figure this out.
> 

Done, same for below requests on lp64 commentary.

>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/dive-vectorize-run-2.c
>> @@ -0,0 +1,53 @@
>> +/* { dg-do run } */
>> +/* { dg-require-effective-target lp64 } */
> 
> Same here.  I suppose this uses builtins that do not exist on 32-bit?
> 

Yeah, those bifs which are guarded with lp64 in their cases are only
supported on 64-bit environment.

>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-run-1.c
>> @@ -0,0 +1,45 @@
>> +/* { dg-do run } */
>> +/* { dg-require-effective-target lp64 } */
> 
> And another.
> 
>> +#define CHECK(name) 
>>   \
>> +  __attribute__ ((optimize (1))) void check_##name ()   
>>   \
> 
> What is the attribute for, btw?  It seems fragile, but perhaps I do not
> understand the intention.
> 
> 

It's to stop compiler from optimizing check functions with vectorization,
since the test point is to compare the results between scalar and vectorized
version.

> Okay for trunk with whose lp64 things improved.  Thanks!
> 

Thanks, v2 has been attached by addressing Bill's and your comments.  :)


BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.c (rs6000_builtin_md_vectorized_function): Add
support for built-in functions MISC_BUILTIN_DIVWE, MISC_BUILTIN_DIVWEU,
MISC_BUILTIN_DIVDE, MISC_BUILTIN_DIVDEU, P10_BUILTIN_CFUGED,
P10_BUILTIN_CNTLZDM, P10_BUILTIN_CNTTZDM, P10_BUILTIN_PDEPD and
P10_BUILTIN_PEXTD on Power10.
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 279f00cc648..a8b3175ed50 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5785,6 +5785,59 @@ rs6000_builtin_md_vectorized_function (tree fndecl, tree 
type_out,
 default:
   break;
 }
+
+  machine_mode in_vmode = TYPE_MODE (type_in);
+  machine_mode out_vmode = TYPE_MODE (type_out);
+
+  /* Power10 supported vectorized built-in functions.  */
+  if (TARGET_POWER10
+  && in_vmode == out_vmode
+  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
+{
+  machine_mode exp_mode = DImode;
+  machine_mode exp_vmode = V2DImode;
+  enum rs6000_builtins name;
+  switch (fn)
+   {
+   case MISC_BUILTIN_DIVWE:
+   case MISC_BUILTIN_DIVWEU:
+ exp_mode = SImode;
+ exp_vmode = V4SImode;
+ if (fn == MISC_BUILTIN_DIVWE)
+   name = P10V_BUILTIN_DIVES_V4SI;
+ else
+   name = P10V_BUILTIN_DIVEU_V4SI;
+ break;
+   ca

Re: [PATCH] rs6000: Add missing unsigned info for some P10 bifs

2021-08-12 Thread Kewen.Lin via Gcc-patches
Hi Bill,

on 2021/8/12 上午12:24, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 8/11/21 12:44 AM, Kewen.Lin wrote:
>> Hi,
>>
>> This patch is to make prototypes of some Power10 built-in
>> functions consistent with what's in the documentation, as
>> well as the vector version.  Otherwise, useless conversions
>> can be generated in gimple IR, and the vectorized versions
>> will have inconsistent types.
>>
>> Bootstrapped & regtested on powerpc64le-linux-gnu P9 and
>> powerpc64-linux-gnu P8.
>>
>> Is it ok for trunk?
> 
> LGTM.  Maintainers, this is necessary in the short term for the old builtins 
> support, but this fragile thing that people always forget will go away with 
> the new support.  What Kewen is proposing here is correct for now.
> 

Thanks for your review and good to know we won't have this kind of issue with 
your new support, nice!!

FWIW, for now the bif vectorization still requires this type consistence to 
make type check happy.


BR,
Kewen

> Thanks,
> Bill
> 
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000-call.c (builtin_function_type): Add unsigned
>> signedness for some Power10 bifs.


Re: [PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-11 Thread Kewen.Lin via Gcc-patches
Hi Bill,

Thanks for your prompt review!

on 2021/8/12 上午12:34, Bill Schmidt wrote:
> Hi Kewen,
> 
> FWIW, it's easier on reviewers if you include the patch inline instead of as 
> an attachment.
> 
> On 8/11/21 1:56 AM, Kewen.Lin wrote:
>> Hi,
>>
>> This patch is to add the support to make vectorizer able to
>> vectorize scalar version of some built-in functions with its
>> corresponding vector version with Power10 support.
>>
>> Bootstrapped & regtested on powerpc64le-linux-gnu {P9,P10}
>> and powerpc64-linux-gnu P8.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  * config/rs6000/rs6000.c (rs6000_builtin_md_vectorized_function): Add
>>  support for some built-in functions vectorized on Power10.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/powerpc/dive-vectorize-1.c: New test.
>>  * gcc.target/powerpc/dive-vectorize-1.h: New test.
>>  * gcc.target/powerpc/dive-vectorize-2.c: New test.
>>  * gcc.target/powerpc/dive-vectorize-2.h: New test.
>>  * gcc.target/powerpc/dive-vectorize-run-1.c: New test.
>>  * gcc.target/powerpc/dive-vectorize-run-2.c: New test.
>>  * gcc.target/powerpc/p10-bifs-vectorize-1.c: New test.
>>  * gcc.target/powerpc/p10-bifs-vectorize-1.h: New test.
>>  * gcc.target/powerpc/p10-bifs-vectorize-run-1.c: New test.
> 
> ---
>  gcc/config/rs6000/rs6000.c| 55 +++
>  .../gcc.target/powerpc/dive-vectorize-1.c | 11 
>  .../gcc.target/powerpc/dive-vectorize-1.h | 22 
>  .../gcc.target/powerpc/dive-vectorize-2.c | 12 
>  .../gcc.target/powerpc/dive-vectorize-2.h | 22 
>  .../gcc.target/powerpc/dive-vectorize-run-1.c | 52 ++
>  .../gcc.target/powerpc/dive-vectorize-run-2.c | 53 ++
>  .../gcc.target/powerpc/p10-bifs-vectorize-1.c | 15 +
>  .../gcc.target/powerpc/p10-bifs-vectorize-1.h | 40 ++
>  .../powerpc/p10-bifs-vectorize-run-1.c| 45 +++
>  10 files changed, 327 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.h
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-2.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-2.h
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-run-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-run-2.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-1.h
>  create mode 100644 
> gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-run-1.c
> 
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 279f00cc648..3eac1d05101 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5785,6 +5785,61 @@ rs6000_builtin_md_vectorized_function (tree fndecl, 
> tree type_out,
>  default:
>break;
>  }
> +
> +  machine_mode in_vmode = TYPE_MODE (type_in);
> +  machine_mode out_vmode = TYPE_MODE (type_out);
> +
> +  /* Power10 supported vectorized built-in functions.  */
> +  if (TARGET_POWER10
> +  && in_vmode == out_vmode
> +  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
> +{
> +  machine_mode exp_mode = DImode;
> +  machine_mode exp_vmode = V2DImode;
> +  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
> 
> Using this as a flag value looks unnecessary.  Is this just being done to 
> silence a warning?
> 

Good question!  I didn't notice there is a warning or not, just get used to 
initializing variable
with one suitable value if possible.  If you don't mind, may I still keep it?  
Since if some
future codes use vname in a path where it's not assigned, one explicitly wrong 
enum (bif) seems
better than a random one.  Or will this mentioned possibility definitely never 
happen since the
current uninitialized variables detection and warning scheme is robust and 
should not worry about
that completely?

> +  switch (fn)
> + {
> + case MISC_BUILTIN_DIVWE:
> + case MISC_BUILTIN_DIVWEU:
> +   exp_mode = SImode;
> +   exp_vmode = V4SImode;
> +   if (fn == MISC_BUILTIN_DIVWE)
> + vname = P10V_BUILTIN_DIVES_V4SI;
> +   else
> + vname = P10V_BUILTIN_DIVEU_V4SI;
> +   break;
> + case MISC_BUILTIN_DIVDE:
> + case MISC_BUILTIN_DIVDEU:
> +   if (fn == MISC_BUILTIN_DIVDE)
> + vname = P10V_BUILTIN_DIVES_V2DI;
> +

[PATCH] rs6000: Make some BIFs vectorized on P10

2021-08-11 Thread Kewen.Lin via Gcc-patches
Hi,

This patch is to add the support to make vectorizer able to
vectorize scalar version of some built-in functions with its
corresponding vector version with Power10 support.

Bootstrapped & regtested on powerpc64le-linux-gnu {P9,P10}
and powerpc64-linux-gnu P8.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000.c (rs6000_builtin_md_vectorized_function): Add
support for some built-in functions vectorized on Power10.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/dive-vectorize-1.c: New test.
* gcc.target/powerpc/dive-vectorize-1.h: New test.
* gcc.target/powerpc/dive-vectorize-2.c: New test.
* gcc.target/powerpc/dive-vectorize-2.h: New test.
* gcc.target/powerpc/dive-vectorize-run-1.c: New test.
* gcc.target/powerpc/dive-vectorize-run-2.c: New test.
* gcc.target/powerpc/p10-bifs-vectorize-1.c: New test.
* gcc.target/powerpc/p10-bifs-vectorize-1.h: New test.
* gcc.target/powerpc/p10-bifs-vectorize-run-1.c: New test.
---
 gcc/config/rs6000/rs6000.c| 55 +++
 .../gcc.target/powerpc/dive-vectorize-1.c | 11 
 .../gcc.target/powerpc/dive-vectorize-1.h | 22 
 .../gcc.target/powerpc/dive-vectorize-2.c | 12 
 .../gcc.target/powerpc/dive-vectorize-2.h | 22 
 .../gcc.target/powerpc/dive-vectorize-run-1.c | 52 ++
 .../gcc.target/powerpc/dive-vectorize-run-2.c | 53 ++
 .../gcc.target/powerpc/p10-bifs-vectorize-1.c | 15 +
 .../gcc.target/powerpc/p10-bifs-vectorize-1.h | 40 ++
 .../powerpc/p10-bifs-vectorize-run-1.c| 45 +++
 10 files changed, 327 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-2.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dive-vectorize-run-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-1.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-bifs-vectorize-run-1.c

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 279f00cc648..3eac1d05101 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5785,6 +5785,61 @@ rs6000_builtin_md_vectorized_function (tree fndecl, tree 
type_out,
 default:
   break;
 }
+
+  machine_mode in_vmode = TYPE_MODE (type_in);
+  machine_mode out_vmode = TYPE_MODE (type_out);
+
+  /* Power10 supported vectorized built-in functions.  */
+  if (TARGET_POWER10
+  && in_vmode == out_vmode
+  && VECTOR_UNIT_ALTIVEC_OR_VSX_P (in_vmode))
+{
+  machine_mode exp_mode = DImode;
+  machine_mode exp_vmode = V2DImode;
+  enum rs6000_builtins vname = RS6000_BUILTIN_COUNT;
+  switch (fn)
+   {
+   case MISC_BUILTIN_DIVWE:
+   case MISC_BUILTIN_DIVWEU:
+ exp_mode = SImode;
+ exp_vmode = V4SImode;
+ if (fn == MISC_BUILTIN_DIVWE)
+   vname = P10V_BUILTIN_DIVES_V4SI;
+ else
+   vname = P10V_BUILTIN_DIVEU_V4SI;
+ break;
+   case MISC_BUILTIN_DIVDE:
+   case MISC_BUILTIN_DIVDEU:
+ if (fn == MISC_BUILTIN_DIVDE)
+   vname = P10V_BUILTIN_DIVES_V2DI;
+ else
+   vname = P10V_BUILTIN_DIVEU_V2DI;
+ break;
+   case P10_BUILTIN_CFUGED:
+ vname = P10V_BUILTIN_VCFUGED;
+ break;
+   case P10_BUILTIN_CNTLZDM:
+ vname = P10V_BUILTIN_VCLZDM;
+ break;
+   case P10_BUILTIN_CNTTZDM:
+ vname = P10V_BUILTIN_VCTZDM;
+ break;
+   case P10_BUILTIN_PDEPD:
+ vname = P10V_BUILTIN_VPDEPD;
+ break;
+   case P10_BUILTIN_PEXTD:
+ vname = P10V_BUILTIN_VPEXTD;
+ break;
+   default:
+ return NULL_TREE;
+   }
+
+  if (vname != RS6000_BUILTIN_COUNT
+ && in_mode == exp_mode
+ && in_vmode == exp_vmode)
+   return rs6000_builtin_decls[vname];
+}
+
   return NULL_TREE;
 }
 
diff --git a/gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.c 
b/gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.c
new file mode 100644
index 000..84f1b0a88f2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dive-vectorize-1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2 -ftree-vectorize 
-fno-vect-cost-model -fno-unroll-loops -fdump-tree-vect-details" } */
+
+/* Test if signed/unsigned int extended divisions get vectorized.  */
+
+#include "dive-vectorize-1.h"
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" 

[PATCH] rs6000: Add missing unsigned info for some P10 bifs

2021-08-10 Thread Kewen.Lin via Gcc-patches
Hi,

This patch is to make prototypes of some Power10 built-in
functions consistent with what's in the documentation, as
well as the vector version.  Otherwise, useless conversions
can be generated in gimple IR, and the vectorized versions
will have inconsistent types.

Bootstrapped & regtested on powerpc64le-linux-gnu P9 and
powerpc64-linux-gnu P8.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/rs6000-call.c (builtin_function_type): Add unsigned
signedness for some Power10 bifs.
---
 gcc/config/rs6000/rs6000-call.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
index 904e104c058..b92928c891a 100644
--- a/gcc/config/rs6000/rs6000-call.c
+++ b/gcc/config/rs6000/rs6000-call.c
@@ -14674,6 +14674,11 @@ builtin_function_type (machine_mode mode_ret, 
machine_mode mode_arg0,
 case P8V_BUILTIN_ORC_V4SI_UNS:
 case P8V_BUILTIN_ORC_V2DI_UNS:
 case P8V_BUILTIN_ORC_V1TI_UNS:
+case P10_BUILTIN_CFUGED:
+case P10_BUILTIN_CNTLZDM:
+case P10_BUILTIN_CNTTZDM:
+case P10_BUILTIN_PDEPD:
+case P10_BUILTIN_PEXTD:
 case P10V_BUILTIN_VCFUGED:
 case P10V_BUILTIN_VCLZDM:
 case P10V_BUILTIN_VCTZDM:
-- 
2.17.1



[PATCH v2] rs6000: Add vec_unpacku_{hi,lo}_v4si

2021-08-08 Thread Kewen.Lin via Gcc-patches
Hi Bill,

Thanks for the comments!

on 2021/8/6 下午9:10, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 8/4/21 9:06 PM, Kewen.Lin wrote:
>> Hi,
>>
>> The existing vec_unpacku_{hi,lo} supports emulated unsigned
>> unpacking for short and char but misses the support for int.
>> This patch adds the support for vec_unpacku_{hi,lo}_v4si.
>>
>> Meanwhile, the current implementation uses vector permutation
>> way, which requires one extra customized constant vector as
>> the permutation control vector.  It's better to use vector
>> merge high/low with zero constant vector, to save the space
>> in constant area as well as the cost to initialize pcv in
>> prologue.  This patch updates it with vector merging and
>> simplify it with iterators.
>>
>> Bootstrapped & regtested on powerpc64le-linux-gnu P9 and
>> powerpc64-linux-gnu P8.
>>
>> btw, the loop in unpack-vectorize-2.c doesn't get vectorized
>> without this patch, unpack-vectorize-[13]* is to verify
>> the vector merging and simplification works expectedly.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
...
>> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
>> index d70c17e6bc2..0e8b66cd6a5 100644
>> --- a/gcc/config/rs6000/altivec.md
>> +++ b/gcc/config/rs6000/altivec.md
>> @@ -134,10 +134,8 @@ (define_c_enum "unspec"
>>     UNSPEC_VMULWLUH
>>     UNSPEC_VMULWHSH
>>     UNSPEC_VMULWLSH
>> -   UNSPEC_VUPKHUB
>> -   UNSPEC_VUPKHUH
>> -   UNSPEC_VUPKLUB
>> -   UNSPEC_VUPKLUH
>> +   UNSPEC_VUPKHUBHW
>> +   UNSPEC_VUPKLUBHW
> 
> 
> Up to you, but... maybe just UNSPEC_VUPKHU and UNSPEC_VUPKLU, in case we 
> extend this later to other types.  Fine either way.
> 

Good point!  Fixed.

>>     UNSPEC_VPERMSI
>>     UNSPEC_VPERMHI
>>     UNSPEC_INTERHI
>> @@ -3885,143 +3883,45 @@ (define_insn "xxeval"
>>     [(set_attr "type" "vecsimple")
>>  (set_attr "prefixed" "yes")])
>>
...
>> diff --git a/gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.c 
>> b/gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.c
>> new file mode 100644
>> index 000..2621d753baa
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.c
>> @@ -0,0 +1,18 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target powerpc_altivec_ok } */
> 
> 
> I guess powerpc_altivec_ok is fine.  I was initially concerned since 
> unpack-vectorize.h mentions vector long long, but the types aren't actually 
> used here.  OK.
> 

Yeah, I think it's fine since unpack-vectorize.h only typedef long long and it 
doesn't
even have type vector long long.

>> +/* { dg-options "-maltivec -O2 -ftree-vectorize -fno-vect-cost-model 
>> -fdump-tree-vect-details" } */
>> +
>> +/* Test if unpack vectorization succeeds for type signed/unsigned
>> +   short and char.  */
>> +
>> +#include "unpack-vectorize-1.h"
>> +
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" } } */
>> +/* { dg-final { scan-assembler {\mvupkhsb\M} } } */
>> +/* { dg-final { scan-assembler {\mvupklsb\M} } } */
>> +/* { dg-final { scan-assembler {\mvupkhsh\M} } } */
>> +/* { dg-final { scan-assembler {\mvupklsh\M} } } */
>> +/* { dg-final { scan-assembler {\mvmrghb\M} } } */
>> +/* { dg-final { scan-assembler {\mvmrglb\M} } } */
>> +/* { dg-final { scan-assembler {\mvmrghh\M} } } */
>> +/* { dg-final { scan-assembler {\mvmrglh\M} } } */
> 
> 
> Suggest that you consider scan-assembler-times 1 to make the tests more 
> robust, here and for other tests.
> 

Updated, thanks!  I was worried that possible future unrolling tweaking can make
the hardcoded times fragile and thought it might be trivial to check the times.
"-fno-unroll-loops" has been added to disable unrolling explicitly as well.

Re-tested on BE and LE, the test results looks fine.

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/altivec.md (vec_unpacku_hi_v16qi): Remove.
(vec_unpacku_hi_v8hi): Likewise.
(vec_unpacku_lo_v16qi): Likewise.
(vec_unpacku_lo_v8hi): Likewise.
(vec_unpacku_hi_): New define_expand.
(vec_unpacku_lo_): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/unpack-vectorize-1.c: New test.
* gcc.target/powerpc/unpack-vectorize-1.h: New test.
* gcc.target/powerpc/unpack-vectorize-2.c: New test.
* gcc.target/powerpc/unpack-vectorize-2.h: New test.
* gcc.target/powerpc/unpack-vectorize-3.c: New test.
   

Re: [PATCH v3] Make loops_list support an optional loop_p root

2021-08-05 Thread Kewen.Lin via Gcc-patches
on 2021/8/4 下午8:04, Richard Biener wrote:
> On Wed, Aug 4, 2021 at 12:47 PM Kewen.Lin  wrote:
>>
>> on 2021/8/4 下午6:01, Richard Biener wrote:
>>> On Wed, Aug 4, 2021 at 4:36 AM Kewen.Lin  wrote:
>>>>
>>>> on 2021/8/3 下午8:08, Richard Biener wrote:
>>>>> On Fri, Jul 30, 2021 at 7:20 AM Kewen.Lin  wrote:
>>>>>>
>>>>>> on 2021/7/29 下午4:01, Richard Biener wrote:
>>>>>>> On Fri, Jul 23, 2021 at 10:41 AM Kewen.Lin  wrote:
>>>>>>>>
>>>>>>>> on 2021/7/22 下午8:56, Richard Biener wrote:
>>>>>>>>> On Tue, Jul 20, 2021 at 4:37
>>>>>>>>> PM Kewen.Lin  wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This v2 has addressed some review comments/suggestions:
>>>>>>>>>>
>>>>>>>>>>   - Use "!=" instead of "<" in function operator!= (const Iter )
>>>>>>>>>>   - Add new CTOR loops_list (struct loops *loops, unsigned flags)
>>>>>>>>>> to support loop hierarchy tree rather than just a function,
>>>>>>>>>> and adjust to use loops* accordingly.
>>>>>>>>>
>>>>>>>>> I actually meant struct loop *, not struct loops * ;)  At the point
>>>>>>>>> we pondered to make loop invariant motion work on single
>>>>>>>>> loop nests we gave up not only but also because it iterates
>>>>>>>>> over the loop nest but all the iterators only ever can process
>>>>>>>>> all loops, not say, all loops inside a specific 'loop' (and
>>>>>>>>> including that 'loop' if LI_INCLUDE_ROOT).  So the
>>>>>>>>> CTOR would take the 'root' of the loop tree as argument.
>>>>>>>>>
>>>>>>>>> I see that doesn't trivially fit how loops_list works, at least
>>>>>>>>> not for LI_ONLY_INNERMOST.  But I guess FROM_INNERMOST
>>>>>>>>> could be adjusted to do ONLY_INNERMOST as well?
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the clarification!  I just realized that the previous
>>>>>>>> version with struct loops* is problematic, all traversal is
>>>>>>>> still bounded with outer_loop == NULL.  I think what you expect
>>>>>>>> is to respect the given loop_p root boundary.  Since we just
>>>>>>>> record the loops' nums, I think we still need the function* fn?
>>>>>>>
>>>>>>> Would it simplify things if we recorded the actual loop *?
>>>>>>>
>>>>>>
>>>>>> I'm afraid it's unsafe to record the loop*.  I had the same
>>>>>> question why the loop iterator uses index rather than loop* when
>>>>>> I read this at the first time.  I guess the design of processing
>>>>>> loops allows its user to update or even delete the folllowing
>>>>>> loops to be visited.  For example, when the user does some tricks
>>>>>> on one loop, then it duplicates the loop and its children to
>>>>>> somewhere and then removes the loop and its children, when
>>>>>> iterating onto its children later, the "index" way will check its
>>>>>> validity by get_loop at that point, but the "loop *" way will
>>>>>> have some recorded pointers to become dangling, can't do the
>>>>>> validity check on itself, seems to need a side linear search to
>>>>>> ensure the validity.
>>>>>>
>>>>>>> There's still the to_visit reserve which needs a bound on
>>>>>>> the number of loops for efficiency reasons.
>>>>>>>
>>>>>>
>>>>>> Yes, I still keep the fn in the updated version.
>>>>>>
>>>>>>>> So I add one optional argument loop_p root and update the
>>>>>>>> visiting codes accordingly.  Before this change, the previous
>>>>>>>> visiting uses the outer_loop == NULL as the termination condition,
>>>>>>>> it perfectly includes the root itself, but with this given root,

[PATCH] rs6000: Add vec_unpacku_{hi,lo}_v4si

2021-08-04 Thread Kewen.Lin via Gcc-patches
Hi,

The existing vec_unpacku_{hi,lo} supports emulated unsigned
unpacking for short and char but misses the support for int.
This patch adds the support for vec_unpacku_{hi,lo}_v4si.

Meanwhile, the current implementation uses vector permutation
way, which requires one extra customized constant vector as
the permutation control vector.  It's better to use vector
merge high/low with zero constant vector, to save the space
in constant area as well as the cost to initialize pcv in
prologue.  This patch updates it with vector merging and
simplify it with iterators.

Bootstrapped & regtested on powerpc64le-linux-gnu P9 and
powerpc64-linux-gnu P8.

btw, the loop in unpack-vectorize-2.c doesn't get vectorized
without this patch, unpack-vectorize-[13]* is to verify
the vector merging and simplification works expectedly.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/altivec.md (vec_unpacku_hi_v16qi): Remove.
(vec_unpacku_hi_v8hi): Likewise.
(vec_unpacku_lo_v16qi): Likewise.
(vec_unpacku_lo_v8hi): Likewise.
(vec_unpacku_hi_): New define_expand.
(vec_unpacku_lo_): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/unpack-vectorize-1.c: New test.
* gcc.target/powerpc/unpack-vectorize-1.h: New test.
* gcc.target/powerpc/unpack-vectorize-2.c: New test.
* gcc.target/powerpc/unpack-vectorize-2.h: New test.
* gcc.target/powerpc/unpack-vectorize-3.c: New test.
* gcc.target/powerpc/unpack-vectorize-3.h: New test.
* gcc.target/powerpc/unpack-vectorize-run-1.c: New test.
* gcc.target/powerpc/unpack-vectorize-run-2.c: New test.
* gcc.target/powerpc/unpack-vectorize-run-3.c: New test.
* gcc.target/powerpc/unpack-vectorize.h: New test.
---
 gcc/config/rs6000/altivec.md  | 158 --
 .../gcc.target/powerpc/unpack-vectorize-1.c   |  18 ++
 .../gcc.target/powerpc/unpack-vectorize-1.h   |  14 ++
 .../gcc.target/powerpc/unpack-vectorize-2.c   |  12 ++
 .../gcc.target/powerpc/unpack-vectorize-2.h   |   7 +
 .../gcc.target/powerpc/unpack-vectorize-3.c   |  11 ++
 .../gcc.target/powerpc/unpack-vectorize-3.h   |   7 +
 .../powerpc/unpack-vectorize-run-1.c  |  24 +++
 .../powerpc/unpack-vectorize-run-2.c  |  16 ++
 .../powerpc/unpack-vectorize-run-3.c  |  16 ++
 .../gcc.target/powerpc/unpack-vectorize.h |  42 +
 11 files changed, 196 insertions(+), 129 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-2.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-3.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-run-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-run-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize.h

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index d70c17e6bc2..0e8b66cd6a5 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -134,10 +134,8 @@ (define_c_enum "unspec"
UNSPEC_VMULWLUH
UNSPEC_VMULWHSH
UNSPEC_VMULWLSH
-   UNSPEC_VUPKHUB
-   UNSPEC_VUPKHUH
-   UNSPEC_VUPKLUB
-   UNSPEC_VUPKLUH
+   UNSPEC_VUPKHUBHW
+   UNSPEC_VUPKLUBHW
UNSPEC_VPERMSI
UNSPEC_VPERMHI
UNSPEC_INTERHI
@@ -3885,143 +3883,45 @@ (define_insn "xxeval"
[(set_attr "type" "vecsimple")
 (set_attr "prefixed" "yes")])
 
-(define_expand "vec_unpacku_hi_v16qi"
-  [(set (match_operand:V8HI 0 "register_operand" "=v")
-(unspec:V8HI [(match_operand:V16QI 1 "register_operand" "v")]
- UNSPEC_VUPKHUB))]
-  "TARGET_ALTIVEC"  
-{  
-  rtx vzero = gen_reg_rtx (V8HImode);
-  rtx mask = gen_reg_rtx (V16QImode);
-  rtvec v = rtvec_alloc (16);
-  bool be = BYTES_BIG_ENDIAN;
-   
-  emit_insn (gen_altivec_vspltish (vzero, const0_rtx));
-   
-  RTVEC_ELT (v,  0) = gen_rtx_CONST_INT (QImode, be ? 16 :  7);
-  RTVEC_ELT (v,  1) = gen_rtx_CONST_INT (QImode, be ?  0 : 16);
-  RTVEC_ELT (v,  2) = gen_rtx_CONST_INT (QImode, be ? 16 :  6);
-  RTVEC_ELT (v,  3) = gen_rtx_CONST_INT (QImode, be ?  1 : 16);
-  RTVEC_ELT (v,  4) = gen_rtx_CONST_INT (QImode, be ? 16 :  5);
-  RTVEC_ELT (v,  5) = gen_rtx_CONST_INT (QImode, be ?  2 : 16);
-  RTVEC_ELT (v,  6) = gen_rtx_CONST_INT (QImode, be ? 16 :  4);
-  RTVEC_ELT (v,  7) = gen_rtx_CONST_INT (QImode, be ?  3 : 16);
-  RTVEC_ELT (v,  8) = gen_rtx_CONST_INT (QImode, be ? 16 :  3);
-  RTVEC_ELT (v,  9) = gen_rtx_CONST_INT (QImode, be ?  4 : 16);
-  RTVEC_ELT (v, 10) = gen_rtx_CONST_INT (QImode, be ? 16 :  2);
-  RTVEC_ELT 

Re: [PATCH v3] Make loops_list support an optional loop_p root

2021-08-04 Thread Kewen.Lin via Gcc-patches
on 2021/8/4 下午6:01, Richard Biener wrote:
> On Wed, Aug 4, 2021 at 4:36 AM Kewen.Lin  wrote:
>>
>> on 2021/8/3 下午8:08, Richard Biener wrote:
>>> On Fri, Jul 30, 2021 at 7:20 AM Kewen.Lin  wrote:
>>>>
>>>> on 2021/7/29 下午4:01, Richard Biener wrote:
>>>>> On Fri, Jul 23, 2021 at 10:41 AM Kewen.Lin  wrote:
>>>>>>
>>>>>> on 2021/7/22 下午8:56, Richard Biener wrote:
>>>>>>> On Tue, Jul 20, 2021 at 4:37
>>>>>>> PM Kewen.Lin  wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This v2 has addressed some review comments/suggestions:
>>>>>>>>
>>>>>>>>   - Use "!=" instead of "<" in function operator!= (const Iter )
>>>>>>>>   - Add new CTOR loops_list (struct loops *loops, unsigned flags)
>>>>>>>> to support loop hierarchy tree rather than just a function,
>>>>>>>> and adjust to use loops* accordingly.
>>>>>>>
>>>>>>> I actually meant struct loop *, not struct loops * ;)  At the point
>>>>>>> we pondered to make loop invariant motion work on single
>>>>>>> loop nests we gave up not only but also because it iterates
>>>>>>> over the loop nest but all the iterators only ever can process
>>>>>>> all loops, not say, all loops inside a specific 'loop' (and
>>>>>>> including that 'loop' if LI_INCLUDE_ROOT).  So the
>>>>>>> CTOR would take the 'root' of the loop tree as argument.
>>>>>>>
>>>>>>> I see that doesn't trivially fit how loops_list works, at least
>>>>>>> not for LI_ONLY_INNERMOST.  But I guess FROM_INNERMOST
>>>>>>> could be adjusted to do ONLY_INNERMOST as well?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for the clarification!  I just realized that the previous
>>>>>> version with struct loops* is problematic, all traversal is
>>>>>> still bounded with outer_loop == NULL.  I think what you expect
>>>>>> is to respect the given loop_p root boundary.  Since we just
>>>>>> record the loops' nums, I think we still need the function* fn?
>>>>>
>>>>> Would it simplify things if we recorded the actual loop *?
>>>>>
>>>>
>>>> I'm afraid it's unsafe to record the loop*.  I had the same
>>>> question why the loop iterator uses index rather than loop* when
>>>> I read this at the first time.  I guess the design of processing
>>>> loops allows its user to update or even delete the folllowing
>>>> loops to be visited.  For example, when the user does some tricks
>>>> on one loop, then it duplicates the loop and its children to
>>>> somewhere and then removes the loop and its children, when
>>>> iterating onto its children later, the "index" way will check its
>>>> validity by get_loop at that point, but the "loop *" way will
>>>> have some recorded pointers to become dangling, can't do the
>>>> validity check on itself, seems to need a side linear search to
>>>> ensure the validity.
>>>>
>>>>> There's still the to_visit reserve which needs a bound on
>>>>> the number of loops for efficiency reasons.
>>>>>
>>>>
>>>> Yes, I still keep the fn in the updated version.
>>>>
>>>>>> So I add one optional argument loop_p root and update the
>>>>>> visiting codes accordingly.  Before this change, the previous
>>>>>> visiting uses the outer_loop == NULL as the termination condition,
>>>>>> it perfectly includes the root itself, but with this given root,
>>>>>> we have to use it as the termination condition to avoid to iterate
>>>>>> onto its possible existing next.
>>>>>>
>>>>>> For LI_ONLY_INNERMOST, I was thinking whether we can use the
>>>>>> code like:
>>>>>>
>>>>>> struct loops *fn_loops = loops_for_fn (fn)->larray;
>>>>>> for (i = 0; vec_safe_iterate (fn_loops, i, ); i++)
>>>>>> if (aloop != NULL
>>>>>> && aloop->inner == NULL
>>>>>>

[PATCH v3] Make loops_list support an optional loop_p root

2021-08-03 Thread Kewen.Lin via Gcc-patches
on 2021/8/3 下午8:08, Richard Biener wrote:
> On Fri, Jul 30, 2021 at 7:20 AM Kewen.Lin  wrote:
>>
>> on 2021/7/29 下午4:01, Richard Biener wrote:
>>> On Fri, Jul 23, 2021 at 10:41 AM Kewen.Lin  wrote:
>>>>
>>>> on 2021/7/22 下午8:56, Richard Biener wrote:
>>>>> On Tue, Jul 20, 2021 at 4:37
>>>>> PM Kewen.Lin  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This v2 has addressed some review comments/suggestions:
>>>>>>
>>>>>>   - Use "!=" instead of "<" in function operator!= (const Iter )
>>>>>>   - Add new CTOR loops_list (struct loops *loops, unsigned flags)
>>>>>> to support loop hierarchy tree rather than just a function,
>>>>>> and adjust to use loops* accordingly.
>>>>>
>>>>> I actually meant struct loop *, not struct loops * ;)  At the point
>>>>> we pondered to make loop invariant motion work on single
>>>>> loop nests we gave up not only but also because it iterates
>>>>> over the loop nest but all the iterators only ever can process
>>>>> all loops, not say, all loops inside a specific 'loop' (and
>>>>> including that 'loop' if LI_INCLUDE_ROOT).  So the
>>>>> CTOR would take the 'root' of the loop tree as argument.
>>>>>
>>>>> I see that doesn't trivially fit how loops_list works, at least
>>>>> not for LI_ONLY_INNERMOST.  But I guess FROM_INNERMOST
>>>>> could be adjusted to do ONLY_INNERMOST as well?
>>>>>
>>>>
>>>>
>>>> Thanks for the clarification!  I just realized that the previous
>>>> version with struct loops* is problematic, all traversal is
>>>> still bounded with outer_loop == NULL.  I think what you expect
>>>> is to respect the given loop_p root boundary.  Since we just
>>>> record the loops' nums, I think we still need the function* fn?
>>>
>>> Would it simplify things if we recorded the actual loop *?
>>>
>>
>> I'm afraid it's unsafe to record the loop*.  I had the same
>> question why the loop iterator uses index rather than loop* when
>> I read this at the first time.  I guess the design of processing
>> loops allows its user to update or even delete the folllowing
>> loops to be visited.  For example, when the user does some tricks
>> on one loop, then it duplicates the loop and its children to
>> somewhere and then removes the loop and its children, when
>> iterating onto its children later, the "index" way will check its
>> validity by get_loop at that point, but the "loop *" way will
>> have some recorded pointers to become dangling, can't do the
>> validity check on itself, seems to need a side linear search to
>> ensure the validity.
>>
>>> There's still the to_visit reserve which needs a bound on
>>> the number of loops for efficiency reasons.
>>>
>>
>> Yes, I still keep the fn in the updated version.
>>
>>>> So I add one optional argument loop_p root and update the
>>>> visiting codes accordingly.  Before this change, the previous
>>>> visiting uses the outer_loop == NULL as the termination condition,
>>>> it perfectly includes the root itself, but with this given root,
>>>> we have to use it as the termination condition to avoid to iterate
>>>> onto its possible existing next.
>>>>
>>>> For LI_ONLY_INNERMOST, I was thinking whether we can use the
>>>> code like:
>>>>
>>>> struct loops *fn_loops = loops_for_fn (fn)->larray;
>>>> for (i = 0; vec_safe_iterate (fn_loops, i, ); i++)
>>>> if (aloop != NULL
>>>> && aloop->inner == NULL
>>>> && flow_loop_nested_p (tree_root, aloop))
>>>>  this->to_visit.quick_push (aloop->num);
>>>>
>>>> it has the stable bound, but if the given root only has several
>>>> child loops, it can be much worse if there are many loops in fn.
>>>> It seems impossible to predict the given root loop hierarchy size,
>>>> maybe we can still use the original linear searching for the case
>>>> loops_for_fn (fn) == root?  But since this visiting seems not so
>>>> performance critical, I chose to share the code originally used
>>>> for FROM_INNERMOST, hope it can have better readability and

<    5   6   7   8   9   10   11   12   13   14   >