from:"Feng Xue OS via Gcc\-patches"

[PATCH] arm/aarch64: Add bti for all functions [PR106671]

2023-08-02 Thread Feng Xue OS via Gcc-patches

This patch extends option -mbranch-protection=bti with an optional argument
as bti[+all] to force compiler to unconditionally insert bti for all
functions. Because a direct function call at the stage of compiling might be
rewritten to an indirect call with some kind of linker-generated thunk stub
as invocation relay for some reasons. One instance is if a direct callee is
placed far from its caller, direct BL {imm} instruction could not represent
the distance, so indirect BLR {reg} should be used. For this case, a bti is
required at the beginning of the callee.

   caller() {
   bl callee
   }

=>

   caller() {
   adrp   reg, 
   addreg, reg, #constant
   blrreg
   }

Although the issue could be fixed with a pretty new version of ld, here we
provide another means for user who has to rely on the old ld or other non-ld
linker. I also checked LLVM, by default, it implements bti just as the proposed
-mbranch-protection=bti+all.

Feng

---
 gcc/config/aarch64/aarch64.cc| 12 +++-
 gcc/config/aarch64/aarch64.opt   |  2 +-
 gcc/config/arm/aarch-bti-insert.cc   |  3 ++-
 gcc/config/arm/aarch-common.cc   | 22 ++
 gcc/config/arm/aarch-common.h| 18 ++
 gcc/config/arm/arm.cc|  4 ++--
 gcc/config/arm/arm.opt   |  2 +-
 gcc/doc/invoke.texi  | 16 ++--
 gcc/testsuite/gcc.target/aarch64/bti-5.c | 17 +
 9 files changed, 76 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/bti-5.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 71215ef9fee..a404447c8d0 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8997,7 +8997,8 @@ void aarch_bti_arch_check (void)
 bool
 aarch_bti_enabled (void)
 {
-  return (aarch_enable_bti == 1);
+  gcc_checking_assert (aarch_enable_bti != AARCH_BTI_FUNCTION_UNSET);
+  return (aarch_enable_bti != AARCH_BTI_FUNCTION_NONE);
 }
 
 /* Check if INSN is a BTI J insn.  */
@@ -18454,12 +18455,12 @@ aarch64_override_options (void)
 
   selected_tune = tune ? tune->ident : cpu->ident;
 
-  if (aarch_enable_bti == 2)
+  if (aarch_enable_bti == AARCH_BTI_FUNCTION_UNSET)
 {
 #ifdef TARGET_ENABLE_BTI
-  aarch_enable_bti = 1;
+  aarch_enable_bti = AARCH_BTI_FUNCTION;
 #else
-  aarch_enable_bti = 0;
+  aarch_enable_bti = AARCH_BTI_FUNCTION_NONE;
 #endif
 }
 
@@ -22881,7 +22882,8 @@ aarch64_print_patchable_function_entry (FILE *file,
   basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
 
   if (!aarch_bti_enabled ()
-  || cgraph_node::get (cfun->decl)->only_called_directly_p ())
+  || (aarch_enable_bti != AARCH_BTI_FUNCTION_ALL
+ && cgraph_node::get (cfun->decl)->only_called_directly_p ()))
 {
   /* Emit the patchable_area at the beginning of the function.  */
   rtx_insn *insn = emit_insn_before (pa, BB_HEAD (bb));
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 025e52d40e5..5571f7e916d 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -37,7 +37,7 @@ TargetVariable
 aarch64_feature_flags aarch64_isa_flags = 0
 
 TargetVariable
-unsigned aarch_enable_bti = 2
+enum aarch_bti_function_type aarch_enable_bti = AARCH_BTI_FUNCTION_UNSET
 
 TargetVariable
 enum aarch_key_type aarch_ra_sign_key = AARCH_KEY_A
diff --git a/gcc/config/arm/aarch-bti-insert.cc 
b/gcc/config/arm/aarch-bti-insert.cc
index 71a77e29406..babd2490c9f 100644
--- a/gcc/config/arm/aarch-bti-insert.cc
+++ b/gcc/config/arm/aarch-bti-insert.cc
@@ -164,7 +164,8 @@ rest_of_insert_bti (void)
  functions that are already protected by Return Address Signing (PACIASP/
  PACIBSP).  For all other cases insert a BTI C at the beginning of the
  function.  */
-  if (!cgraph_node::get (cfun->decl)->only_called_directly_p ())
+  if (aarch_enable_bti == AARCH_BTI_FUNCTION_ALL
+  || !cgraph_node::get (cfun->decl)->only_called_directly_p ())
 {
   bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
   insn = BB_HEAD (bb);
diff --git a/gcc/config/arm/aarch-common.cc b/gcc/config/arm/aarch-common.cc
index 5b96ff4c2e8..7751d40f909 100644
--- a/gcc/config/arm/aarch-common.cc
+++ b/gcc/config/arm/aarch-common.cc
@@ -666,7 +666,7 @@ static enum aarch_parse_opt_result
 aarch_handle_no_branch_protection (char* str, char* rest)
 {
   aarch_ra_sign_scope = AARCH_FUNCTION_NONE;
-  aarch_enable_bti = 0;
+  aarch_enable_bti = AARCH_BTI_FUNCTION_NONE;
   if (rest)
 {
   error ("unexpected %<%s%> after %<%s%>", rest, str);
@@ -680,7 +680,7 @@ aarch_handle_standard_branch_protection (char* str, char* 
rest)
 {
   aarch_ra_sign_scope = AARCH_FUNCTION_NON_LEAF;
   aarch_ra_sign_key = AARCH_KEY_A;
-  aarch_enable_bti = 1;
+  aarch_enable_bti = AARCH_BTI_FUNCTION;
   if (rest)
 {
   error ("unexpected %<%s%> after %<%s%>",

PING^2: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

2021-10-14 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS
Sent: Thursday, September 16, 2021 5:26 PM
To: Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org
Cc: JiangNing OS
Subject: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

This patch is to extend applicability  of full devirtualization to LTRANS stage.
Normally, whole program assumption would not hold when WPA splits
whole compilation into more than one LTRANS partitions. To avoid information
lost for WPD at LTRANS, we will record all vtable nodes and related member
function references into each partition.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* tree.h (TYPE_CXX_LOCAL): New macro for type using
base.nothrow_flag.
* tree-core.h (tree_base): Update comment on using
base.nothrow_flag to represent TYPE_CXX_LOCAL.
* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
on type, and enable WPD at LTRANS when flag_devirtualize_fully
is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
flag_devirtualize_fully is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
to boundary of each LTRANS partition.
* lto-streamer-out.c (get_symbol_initial_value): Streaming out
initial value of vtable even its class is optimized away.
* lto-lang.c (lto_post_options): Disable full devirtualization
if flag_ltrans_devirtualize is false.
* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
of TYPE_CXX_LOCAL for a type from streaming data.
* tree-streamer-out.c (pack_ts_base_value_fields): pack value
ofTYPE_CXX_LOCAL for a type into streaming data.
---
From 624aef44d72799ae488a431b4dce730f4b0fc28e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 20:34:50 +0800
Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS

Whole program assumption would not hold when WPA splits whole compilation
into more than one LTRANS partitions. To avoid information lost for WPD
at LTRANS, we will record all vtable nodes and related member function
references into each partition.

2021-09-07  Feng Xue  

gcc/
	* tree.h (TYPE_CXX_LOCAL): New macro for type using
	base.nothrow_flag.
   	* tree-core.h (tree_base): Update comment on using
	base.nothrow_flag to represent TYPE_CXX_LOCAL.
	* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
	on type, and enable WPD at LTRANS when flag_devirtualize_fully
	is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
	in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
	flag_devirtualize_fully	is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
	stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
	to boundary of each LTRANS partition.
	* lto-streamer-out.c (get_symbol_initial_value): Streaming out
	initial	value of vtable even its class is optimized away.
	* lto-streamer-in.c (lto_input_tree): There might be more than
	one decls in dref_queue, register debuginfo for all of them.
	* lto-lang.c (lto_post_options): Disable full devirtualization
	if flag_ltrans_devirtualize is false.
	* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
	of TYPE_CXX_LOCAL for a type from streaming data.
	* tree-streamer-out.c (pack_ts_base_value_fields): pack value
	ofTYPE_CXX_LOCAL for a type into streaming data.
---
 gcc/ipa-devirt.c| 33 ++---
 gcc/ipa.c   |  7 ++-
 gcc/lto-cgraph.c| 19 +++
 gcc/lto-streamer-in.c   |  3 +--
 gcc/lto-streamer-out.c  | 12 +++-
 gcc/lto/lto-lang.c  |  6 ++
 gcc/tree-core.h |  3 +++
 gcc/tree-streamer-in.c  | 11 ---
 gcc/tree-streamer-out.c | 16 +---
 gcc/tree.h  |  5 +
 10 files changed, 90 insertions(+), 25 deletions(-)

diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c
index 284c449c6c1..bb929f016f8 100644
--- a/gcc/ipa-devirt.c
+++ b/gcc/ipa-devirt.c
@@ -216,8 +216,6 @@ struct GTY(()) odr_type_d
   int id;
   /* Is it in anonymous namespace? */
   bool anonymous_namespace;
-  /* Set when type is not used outside of program.  */
-  bool whole_program_local;
   /* Did we

PING^2: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-10-14 Thread Feng Xue OS via Gcc-patches

Hi, Honza & Martin,

  Would you please take some time to review proposal and patches of whole
program devirtualization? We have to say, this feature is not 100% safe, but
provides us a way to deploy correct WPD on C++ program if we elaborately
prepare linked libraries to ensure rtti symbols are contained, which is always
the case for libstdc++ and well-composed third-part c++libraries with default
gcc options. If not, we could get an expected rebuild with desirable options,
and this does not require invasive modification on source codes, which is an
advantage over LLVM visibility-based scheme.

Now gcc-12 dev branch is at late stage since time will step into Nov.  Anyway,
we are not sure it is acceptable or not. But if yes, getting it in before code
freeze would be a good time point. 

And made some minor changes on patches, also posted RFC link here for
your convenience.  (https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html) 

Thanks,
Feng


From: Feng Xue OS 
Sent: Saturday, September 18, 2021 5:38 PM
To: Jason Merrill; Jan Hubicka; mjam...@suse.cz; Richard Biener; 
gcc-patches@gcc.gnu.org
Subject: Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

>On 9/16/21 22:29, Feng Xue OS wrote:
>>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>>>> This and following patches are composed to enable full devirtualization
>>>> under whole program assumption (so also called whole-program
>>>> devirtualization, WPD for short), which is an enhancement to current
>>>> speculative devirtualization. The base of the optimization is how to
>>>> identify class type that is local in terms of whole-program scope, at
>>>> least  those class types in libstdc++ must be excluded in some way.
>>>> Our means is to use typeinfo symbol as identity marker of a class since
>>>> it is unique and always generated once the class or its derived type
>>>> is instantiated somewhere, and rely on symbol resolution by
>>>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>>>> object/library, which indirectly tells class types are escaped or not.
>>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>>>> gives more details on that.
>>>>
>>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>> 
>>>> 2021-09-07  Feng Xue  
>>>>
>>>> gcc/
>>>>* common.opt (-fdevirtualize-fully): New option.
>>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>>>even -fno-rtti is specificied under full devirtualization.
>>>
>>> This makes -fno-rtti useless; rather than this, you should warn about
>>> the combination of flags and force flag_rtti on.  It also sounds like
>>> you depend on the library not being built with -fno-rtti.
>>
>> Although rtti is generated by front-end, we will remove it after lto symtab
>> merge, which is meant to keep same behavior as -fno-rtti.
>
> Ah, the cp/ change is OK, then, with a comment about that.
>
>> Yes, regular library to be linked with should contain rtti data, otherwise
>> WPD could not deduce class type usage safely. By default, we can think
>> that it should work for libstdc++, but it probably becomes a problem for
>> user library, which might be avoided if we properly document this
>> requirement and suggest user doing that when using WPD.
>
> Yes, I would expect that external libraries would be built with RTTI on
> to allow users to use RTTI features even if they aren't used within the
> library.  But it's good to document it as a requirement.
>
>> +   /* If a class with virtual base is only instantiated as
>> +  subobjects of derived classes, and has no complete object in
>> +  compilation unit, merely construction vtables will be 
>> involved,
>> +  its primary vtable is really not needed, and subject to being
>> +  removed.  So once a vtable node is encountered, for all
>> +  polymorphic base classes of the vtable's context class, always
>> +  force generation of primary vtable nodes when full
>> +  devirtualization is enabled.  */
>
> Why do you need the primary vtable if you're relying on RTTI info?
> Construction vtables will point to the same RTTI node.

At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type),
it is the primary one. And WPD relies on existence of varpool_node of the
vtable decl to determine if th

PING: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

2021-09-29 Thread Feng Xue OS via Gcc-patches

Made some minor changes.

Thanks,
Feng


From: Feng Xue OS
Sent: Thursday, September 16, 2021 5:26 PM
To: Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org
Cc: JiangNing OS
Subject: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

This patch is to extend applicability  of full devirtualization to LTRANS stage.
Normally, whole program assumption would not hold when WPA splits
whole compilation into more than one LTRANS partitions. To avoid information
lost for WPD at LTRANS, we will record all vtable nodes and related member
function references into each partition.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* tree.h (TYPE_CXX_LOCAL): New macro for type using
base.nothrow_flag.
* tree-core.h (tree_base): Update comment on using
base.nothrow_flag to represent TYPE_CXX_LOCAL.
* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
on type, and enable WPD at LTRANS when flag_devirtualize_fully
is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
flag_devirtualize_fully is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
to boundary of each LTRANS partition.
* lto-streamer-out.c (get_symbol_initial_value): Streaming out
initial value of vtable even its class is optimized away.
* lto-lang.c (lto_post_options): Disable full devirtualization
if flag_ltrans_devirtualize is false.
* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
of TYPE_CXX_LOCAL for a type from streaming data.
* tree-streamer-out.c (pack_ts_base_value_fields): pack value
ofTYPE_CXX_LOCAL for a type into streaming data.
---
From 2c0d243b0c092585561c732bac490700f41001fb Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 20:34:50 +0800
Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS

Whole program assumption would not hold when WPA splits whole compilation
into more than one LTRANS partitions. To avoid information lost for WPD
at LTRANS, we will record all vtable nodes and related member function
references into each partition.

2021-09-07  Feng Xue  

gcc/
	* tree.h (TYPE_CXX_LOCAL): New macro for type using
	base.nothrow_flag.
   	* tree-core.h (tree_base): Update comment on using
	base.nothrow_flag to represent TYPE_CXX_LOCAL.
	* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
	on type, and enable WPD at LTRANS when flag_devirtualize_fully
	is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
	in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
	flag_devirtualize_fully	is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
	stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
	to boundary of each LTRANS partition.
	* lto-streamer-out.c (get_symbol_initial_value): Streaming out
	initial	value of vtable even its class is optimized away.
	* lto-streamer-in.c (lto_input_tree): There might be more than
	one decls in dref_queue, register debuginfo for all of them.
	* lto-lang.c (lto_post_options): Disable full devirtualization
	if flag_ltrans_devirtualize is false.
	* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
	of TYPE_CXX_LOCAL for a type from streaming data.
	* tree-streamer-out.c (pack_ts_base_value_fields): pack value
	ofTYPE_CXX_LOCAL for a type into streaming data.

temp
---
 gcc/ipa-devirt.c| 29 ++---
 gcc/ipa.c   |  7 ++-
 gcc/lto-cgraph.c| 18 ++
 gcc/lto-streamer-in.c   |  3 +--
 gcc/lto-streamer-out.c  | 12 +++-
 gcc/lto/lto-lang.c  |  6 ++
 gcc/tree-core.h |  3 +++
 gcc/tree-streamer-in.c  | 11 ---
 gcc/tree-streamer-out.c | 11 ---
 gcc/tree.h  |  5 +
 10 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c
index a7d04388dab..4ff551bace8 100644
--- a/gcc/ipa-devirt.c
+++ b/gcc/ipa-devirt.c
@@ -216,8 +216,6 @@ struct GTY(()) odr_type_d
   int id;
   /* Is it in anonymous namespace? */
   bool anonymous_namespace;
-  /* Set when type is not used outside of program.  */
-  bool

PING: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-29 Thread Feng Xue OS via Gcc-patches

Minor update for some bugfixs and comment wording change.

Thanks,
Feng


From: Feng Xue OS 
Sent: Saturday, September 18, 2021 5:38 PM
To: Jason Merrill; Jan Hubicka; mjam...@suse.cz; Richard Biener; 
gcc-patches@gcc.gnu.org
Subject: Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

>On 9/16/21 22:29, Feng Xue OS wrote:
>>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>>>> This and following patches are composed to enable full devirtualization
>>>> under whole program assumption (so also called whole-program
>>>> devirtualization, WPD for short), which is an enhancement to current
>>>> speculative devirtualization. The base of the optimization is how to
>>>> identify class type that is local in terms of whole-program scope, at
>>>> least  those class types in libstdc++ must be excluded in some way.
>>>> Our means is to use typeinfo symbol as identity marker of a class since
>>>> it is unique and always generated once the class or its derived type
>>>> is instantiated somewhere, and rely on symbol resolution by
>>>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>>>> object/library, which indirectly tells class types are escaped or not.
>>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>>>> gives more details on that.
>>>>
>>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>> 
>>>> 2021-09-07  Feng Xue  
>>>>
>>>> gcc/
>>>>* common.opt (-fdevirtualize-fully): New option.
>>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>>>even -fno-rtti is specificied under full devirtualization.
>>>
>>> This makes -fno-rtti useless; rather than this, you should warn about
>>> the combination of flags and force flag_rtti on.  It also sounds like
>>> you depend on the library not being built with -fno-rtti.
>>
>> Although rtti is generated by front-end, we will remove it after lto symtab
>> merge, which is meant to keep same behavior as -fno-rtti.
>
> Ah, the cp/ change is OK, then, with a comment about that.
>
>> Yes, regular library to be linked with should contain rtti data, otherwise
>> WPD could not deduce class type usage safely. By default, we can think
>> that it should work for libstdc++, but it probably becomes a problem for
>> user library, which might be avoided if we properly document this
>> requirement and suggest user doing that when using WPD.
>
> Yes, I would expect that external libraries would be built with RTTI on
> to allow users to use RTTI features even if they aren't used within the
> library.  But it's good to document it as a requirement.
>
>> +   /* If a class with virtual base is only instantiated as
>> +  subobjects of derived classes, and has no complete object in
>> +  compilation unit, merely construction vtables will be 
>> involved,
>> +  its primary vtable is really not needed, and subject to being
>> +  removed.  So once a vtable node is encountered, for all
>> +  polymorphic base classes of the vtable's context class, always
>> +  force generation of primary vtable nodes when full
>> +  devirtualization is enabled.  */
>
> Why do you need the primary vtable if you're relying on RTTI info?
> Construction vtables will point to the same RTTI node.

At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type),
it is the primary one. And WPD relies on existence of varpool_node of the
vtable decl to determine if the type has been removed (when it is never
instantiated), so we will force generation of vtable node at very early stage.
Additionally, construction vtable (C-in-D) belongs to the class (D) of complete
object, not the class (C) of subobject actually being constructed for, it is not
easy to correlate construction vtable with the subobject class (C) after front
end.

>
>> +   /* Public class w/o key member function (or local class in a public
>> +  inline function) requires COMDAT-like vtable so as to be shared
>> +  among units.  But C++ privatizing via -fno-weak would introduce
>> +  multiple static vtable copies for one class in merged lto symbol
>> +  table.  This breaks one-to-one correspondence between class and
>> +  vtable, and makes class liveness check become not that easy.  To
>> +

[PATCH] Fix value uninitialization in vn_reference_insert_pieces [PR102400]

2021-09-22 Thread Feng Xue OS via Gcc-patches

Bootstrapped/regtested on x86_64-linux.

Thanks,
Feng
---
2021-09-23  Feng Xue  

gcc/ChangeLog
PR tree-optimization/102400
* tree-ssa-sccvn.c (vn_reference_insert_pieces): Initialize
result_vdef to zero value.
---
 gcc/tree-ssa-sccvn.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index a901f51a025..e8b1c39184d 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -3811,6 +3811,7 @@ vn_reference_insert_pieces (tree vuse, alias_set_type set,
   if (result && TREE_CODE (result) == SSA_NAME)
 result = SSA_VAL (result);
   vr1->result = result;
+  vr1->result_vdef = NULL_TREE;
 
   slot = valid_info->references->find_slot_with_hash (vr1, vr1->hashcode,
  INSERT);
-- 
2.17.1

[PATCH] Fix null-pointer dereference in delete_dead_or_redundant_call [PR102451]

2021-09-22 Thread Feng Xue OS via Gcc-patches

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng

---
2021-09-23  Feng Xue  

gcc/ChangeLog:
PR tree-optimization/102451
* tree-ssa-dse.c (delete_dead_or_redundant_call): Record bb of stmt
before removal.
---
 gcc/tree-ssa-dse.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 98daa8ab24c..27287fe88ee 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -978,6 +978,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, 
const char *type)
   fprintf (dump_file, "\n");
 }
 
+  basic_block bb = gimple_bb (stmt);
   tree lhs = gimple_call_lhs (stmt);
   if (lhs)
 {
@@ -985,7 +986,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, 
const char *type)
   gimple *new_stmt = gimple_build_assign (lhs, ptr);
   unlink_stmt_vdef (stmt);
   if (gsi_replace (gsi, new_stmt, true))
-bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
+   bitmap_set_bit (need_eh_cleanup, bb->index);
 }
   else
 {
@@ -994,7 +995,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, 
const char *type)
 
   /* Remove the dead store.  */
   if (gsi_remove (gsi, true))
-   bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
+   bitmap_set_bit (need_eh_cleanup, bb->index);
   release_defs (stmt);
 }
 }
-- 
2.17.1

Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-18 Thread Feng Xue OS via Gcc-patches

>On 9/16/21 22:29, Feng Xue OS wrote:
>>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>>>> This and following patches are composed to enable full devirtualization
>>>> under whole program assumption (so also called whole-program
>>>> devirtualization, WPD for short), which is an enhancement to current
>>>> speculative devirtualization. The base of the optimization is how to
>>>> identify class type that is local in terms of whole-program scope, at
>>>> least  those class types in libstdc++ must be excluded in some way.
>>>> Our means is to use typeinfo symbol as identity marker of a class since
>>>> it is unique and always generated once the class or its derived type
>>>> is instantiated somewhere, and rely on symbol resolution by
>>>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>>>> object/library, which indirectly tells class types are escaped or not.
>>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>>>> gives more details on that.
>>>>
>>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>> 
>>>> 2021-09-07  Feng Xue  
>>>>
>>>> gcc/
>>>>* common.opt (-fdevirtualize-fully): New option.
>>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>>>even -fno-rtti is specificied under full devirtualization.
>>>
>>> This makes -fno-rtti useless; rather than this, you should warn about
>>> the combination of flags and force flag_rtti on.  It also sounds like
>>> you depend on the library not being built with -fno-rtti.
>>
>> Although rtti is generated by front-end, we will remove it after lto symtab
>> merge, which is meant to keep same behavior as -fno-rtti.
>
> Ah, the cp/ change is OK, then, with a comment about that.
>
>> Yes, regular library to be linked with should contain rtti data, otherwise
>> WPD could not deduce class type usage safely. By default, we can think
>> that it should work for libstdc++, but it probably becomes a problem for
>> user library, which might be avoided if we properly document this
>> requirement and suggest user doing that when using WPD.
>
> Yes, I would expect that external libraries would be built with RTTI on
> to allow users to use RTTI features even if they aren't used within the
> library.  But it's good to document it as a requirement.
>
>> +   /* If a class with virtual base is only instantiated as
>> +  subobjects of derived classes, and has no complete object in
>> +  compilation unit, merely construction vtables will be 
>> involved,
>> +  its primary vtable is really not needed, and subject to being
>> +  removed.  So once a vtable node is encountered, for all
>> +  polymorphic base classes of the vtable's context class, always
>> +  force generation of primary vtable nodes when full
>> +  devirtualization is enabled.  */
>
> Why do you need the primary vtable if you're relying on RTTI info?
> Construction vtables will point to the same RTTI node.

At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type),
it is the primary one. And WPD relies on existence of varpool_node of the
vtable decl to determine if the type has been removed (when it is never
instantiated), so we will force generation of vtable node at very early stage.
Additionally, construction vtable (C-in-D) belongs to the class (D) of complete
object, not the class (C) of subobject actually being constructed for, it is not
easy to correlate construction vtable with the subobject class (C) after front
end.

>
>> +   /* Public class w/o key member function (or local class in a public
>> +  inline function) requires COMDAT-like vtable so as to be shared
>> +  among units.  But C++ privatizing via -fno-weak would introduce
>> +  multiple static vtable copies for one class in merged lto symbol
>> +  table.  This breaks one-to-one correspondence between class and
>> +  vtable, and makes class liveness check become not that easy.  To
>> +  be simple, we exclude such kind of class from our choice list.
>
> Same question.  Also, why would you use -fno-weak?  Forcing multiple
> copies of things we're perfectly capable of combining seems like a
> strange choice.  You can privatize things with the symbol visibility
> controls or RTLD_LOCAL.

We expect that user does not specify -fno-weak for WPD. But if
specified, we should correctly handle that and bypass the type. And
indeed there is no need to force generation of vtable under this
situation.  But if vtable is not keyed to any compilation unit, we might
never have any copy of it in ordinary build, while its class type is
meaningful to whole-program analysis, such as an abstract root class.

Thanks,
Feng

Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-16 Thread Feng Xue OS via Gcc-patches

>On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>> This and following patches are composed to enable full devirtualization
>> under whole program assumption (so also called whole-program
>> devirtualization, WPD for short), which is an enhancement to current
>> speculative devirtualization. The base of the optimization is how to
>> identify class type that is local in terms of whole-program scope, at
>> least  those class types in libstdc++ must be excluded in some way.
>> Our means is to use typeinfo symbol as identity marker of a class since
>> it is unique and always generated once the class or its derived type
>> is instantiated somewhere, and rely on symbol resolution by
>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>> object/library, which indirectly tells class types are escaped or not.
>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>> gives more details on that.
>>
>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>
>> Thanks,
>> Feng
>>
>> 
>> 2021-09-07  Feng Xue  
>>
>> gcc/
>>   * common.opt (-fdevirtualize-fully): New option.
>>   * class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>   even -fno-rtti is specificied under full devirtualization.
>
>This makes -fno-rtti useless; rather than this, you should warn about
>the combination of flags and force flag_rtti on.  It also sounds like
>you depend on the library not being built with -fno-rtti.

Although rtti is generated by front-end, we will remove it after lto symtab
merge, which is meant to keep same behavior as -fno-rtti.

Yes, regular library to be linked with should contain rtti data, otherwise
WPD could not deduce class type usage safely. By default, we can think
that it should work for libstdc++, but it probably becomes a problem for
user library, which might be avoided if we properly document this
requirement and suggest user doing that when using WPD.

Thanks
Feng
>
>>   * cgraph.c (cgraph_update_edges_for_call_stmt): Add an assertion
>>   to check node to be traversed.
>>   * cgraphclones.c (cgraph_node::find_replacement): Record
>>   former_clone_of on replacement node.
>>   * cgraphunit.c (symtab_node::needed_p): Always output vtable for
>>   full devirtualization.
>>   (analyze_functions): Force generation of primary vtables for all
>>   base classes.
>>   * ipa-devirt.c (odr_type_d::whole_program_local): New field.
>>   (odr_type_d::has_virtual_base): Likewise.
>>   (odr_type_d::all_derivations_known): Removed.
>>   (odr_type_d::whole_program_local_p): New member function.
>>   (odr_type_d::all_derivations_known_p): Likewise.
>>   (odr_type_d::possibly_instantiated_p): Likewise.
>>   (odr_type_d::set_has_virtual_base): Likewise.
>>   (get_odr_type): Set "whole_program_local" and "has_virtual_base"
>>   when adding a type.
>>   (type_all_derivations_known_p): Replace implementation by a call
>>   to odr_type_d::all_derivations_known_p.
>>   (type_possibly_instantiated_p): Replace implementation by a call
>>   to odr_type_d::possibly_instantiated_p.
>>   (type_known_to_have_no_derivations_p): Replace call to
>>   type_possibly_instantiated_p with call to
>>   odr_type_d::possibly_instantiated_p.
>>   (type_all_ctors_visible_p): Removed.
>>   (type_whole_program_local_p): New function.
>>   (get_type_vtable): Likewise.
>>   (extract_typeinfo_in_vtable): Likewise.
>>   (identify_whole_program_local_types): Likewise.
>>   (dump_odr_type): Dump has_virtual_base and whole_program_local_p()
>>   of type.
>>   (maybe_record_node): Resort to type_whole_program_local_p to
>>   check whether a class has been optimized away.
>>   (record_target_from_binfo): Remove parameter "anonymous", add
>>   a new parameter "possibly_instantiated", and adjust code
>>   accordingly.
>>   (devirt_variable_node_removal_hook): Replace call to
>>   "type_in_anonymous_namespace_p" with "type_whole_program_local_p".
>>   (possible_polymorphic_call_targets): Replace call to
>>   "type_possibly_instantiated_p" with "possibly_instantiated_p",
>>   replace flag check on "all_derivations_known" with call to
>>"all_derivations_known_p".
>>   * ipa-icf.c (filter_removed_items): Disable folding on vtable
>>   under full devirtualization.
>>   * ipa-

[PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

2021-09-16 Thread Feng Xue OS via Gcc-patches

This patch is to extend applicability  of full devirtualization to LTRANS stage.
Normally, whole program assumption would not hold when WPA splits
whole compilation into more than one LTRANS partitions. To avoid information
lost for WPD at LTRANS, we will record all vtable nodes and related member
function references into each partition.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* tree.h (TYPE_CXX_LOCAL): New macro for type using
base.nothrow_flag.
* tree-core.h (tree_base): Update comment on using
base.nothrow_flag to represent TYPE_CXX_LOCAL.
* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
on type, and enable WPD at LTRANS when flag_devirtualize_fully
is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
flag_devirtualize_fully is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
to boundary of each LTRANS partition.
* lto-streamer-out.c (get_symbol_initial_value): Streaming out
initial value of vtable even its class is optimized away.
* lto-lang.c (lto_post_options): Disable full devirtualization
if flag_ltrans_devirtualize is false.
* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
of TYPE_CXX_LOCAL for a type from streaming data.
* tree-streamer-out.c (pack_ts_base_value_fields): pack value
ofTYPE_CXX_LOCAL for a type into streaming data.
---
From 3af32b9aadff23d339750ada4541386b3d358edc Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 20:34:50 +0800
Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS

Whole program assumption would not hold when WPA splits whole compilation
into more than one LTRANS partitions. To avoid information lost for WPD
at LTRANS, we will record all vtable nodes and related member function
references into each partition.

2021-09-07  Feng Xue  

gcc/
	* tree.h (TYPE_CXX_LOCAL): New macro for type using
	base.nothrow_flag.
   	* tree-core.h (tree_base): Update comment on using
	base.nothrow_flag to represent TYPE_CXX_LOCAL.
	* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
	on type, and enable WPD at LTRANS when flag_devirtualize_fully
	is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
	in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
	flag_devirtualize_fully	is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
	stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
	to boundary of each LTRANS partition.
	* lto-streamer-out.c (get_symbol_initial_value): Streaming out
	initial	value of vtable even its class is optimized away.
	* lto-lang.c (lto_post_options): Disable full devirtualization
	if flag_ltrans_devirtualize is false.
	* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
	of TYPE_CXX_LOCAL for a type from streaming data.
	* tree-streamer-out.c (pack_ts_base_value_fields): pack value
	ofTYPE_CXX_LOCAL for a type into streaming data.
---
 gcc/ipa-devirt.c| 29 ++---
 gcc/ipa.c   |  7 ++-
 gcc/lto-cgraph.c| 18 ++
 gcc/lto-streamer-out.c  | 12 +++-
 gcc/lto/lto-lang.c  |  6 ++
 gcc/tree-core.h |  3 +++
 gcc/tree-streamer-in.c  | 11 ---
 gcc/tree-streamer-out.c | 11 ---
 gcc/tree.h  |  5 +
 9 files changed, 83 insertions(+), 19 deletions(-)

diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c
index fcb097d7156..65e9ebbfb59 100644
--- a/gcc/ipa-devirt.c
+++ b/gcc/ipa-devirt.c
@@ -216,8 +216,6 @@ struct GTY(()) odr_type_d
   int id;
   /* Is it in anonymous namespace? */
   bool anonymous_namespace;
-  /* Set when type is not used outside of program.  */
-  bool whole_program_local;
   /* Did we report ODR violation here?  */
   bool odr_violated;
   /* Set when virtual table without RTTI prevailed table with.  */
@@ -290,10 +288,18 @@ get_type_vtable (tree type)
 bool
 odr_type_d::whole_program_local_p ()
 {
-  if (flag_ltrans)
+  if (flag_ltrans && !flag_devirtualize_fully)
 return false;
 
-  return whole_program_local;
+  if (in_lto_p)
+return TYPE_CXX_LOCAL (type);
+
+  /* Although a local class is always considered as whole program

[PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-16 Thread Feng Xue OS via Gcc-patches

This and following patches are composed to enable full devirtualization
under whole program assumption (so also called whole-program
devirtualization, WPD for short), which is an enhancement to current
speculative devirtualization. The base of the optimization is how to
identify class type that is local in terms of whole-program scope, at
least  those class types in libstdc++ must be excluded in some way.
Our means is to use typeinfo symbol as identity marker of a class since
it is unique and always generated once the class or its derived type
is instantiated somewhere, and rely on symbol resolution by
lto-linker-plugin to detect whether  a typeinfo is referenced by regular
object/library, which indirectly tells class types are escaped or not.
The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
gives more details on that.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* common.opt (-fdevirtualize-fully): New option.
* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
even -fno-rtti is specificied under full devirtualization.
* cgraph.c (cgraph_update_edges_for_call_stmt): Add an assertion
to check node to be traversed.
* cgraphclones.c (cgraph_node::find_replacement): Record
former_clone_of on replacement node.
* cgraphunit.c (symtab_node::needed_p): Always output vtable for
full devirtualization.
(analyze_functions): Force generation of primary vtables for all
base classes.
* ipa-devirt.c (odr_type_d::whole_program_local): New field.
(odr_type_d::has_virtual_base): Likewise.
(odr_type_d::all_derivations_known): Removed.
(odr_type_d::whole_program_local_p): New member function.
(odr_type_d::all_derivations_known_p): Likewise.
(odr_type_d::possibly_instantiated_p): Likewise.
(odr_type_d::set_has_virtual_base): Likewise.
(get_odr_type): Set "whole_program_local" and "has_virtual_base"
when adding a type.
(type_all_derivations_known_p): Replace implementation by a call
to odr_type_d::all_derivations_known_p.
(type_possibly_instantiated_p): Replace implementation by a call
to odr_type_d::possibly_instantiated_p.
(type_known_to_have_no_derivations_p): Replace call to
type_possibly_instantiated_p with call to
odr_type_d::possibly_instantiated_p.
(type_all_ctors_visible_p): Removed.
(type_whole_program_local_p): New function.
(get_type_vtable): Likewise.
(extract_typeinfo_in_vtable): Likewise.
(identify_whole_program_local_types): Likewise.
(dump_odr_type): Dump has_virtual_base and whole_program_local_p()
of type.
(maybe_record_node): Resort to type_whole_program_local_p to
check whether a class has been optimized away.
(record_target_from_binfo): Remove parameter "anonymous", add
a new parameter "possibly_instantiated", and adjust code
accordingly.
(devirt_variable_node_removal_hook): Replace call to
"type_in_anonymous_namespace_p" with "type_whole_program_local_p".
(possible_polymorphic_call_targets): Replace call to
"type_possibly_instantiated_p" with "possibly_instantiated_p",
replace flag check on "all_derivations_known" with call to
 "all_derivations_known_p".
* ipa-icf.c (filter_removed_items): Disable folding on vtable
under full devirtualization.
* ipa-polymorphic-call.c (restrict_to_inner_class): Move odr
type check to type_known_to_have_no_derivations_p.
* ipa-utils.h (identify_whole_program_local_types): New
declaration.
(type_all_derivations_known_p): Parameter type adjustment.
* ipa.c (walk_polymorphic_call_targets): Do not mark vcall
targets as reachable for full devirtualization.
(can_remove_vtable_if_no_refs_p): New function.
(symbol_table::remove_unreachable_nodes): Add defined vtables
to reachable list under full devirtualization.
* lto-symtab.c (lto_symtab_merge_symbols): Identify whole
program local types after symbol table merge.
---From 2632d8e7ea8f96cb545e57dedd9e4148b5a2cae4 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 15:03:31 +0800
Subject: [PATCH 1/2] WPD: Enable whole program devirtualization

Enable full devirtualization under whole program assumption (so also
called whole-program devirtualization, WPD for short). The base of the
optimization is how to identify class type that is local in terms of
whole-program scope. But "whole program" does not ensure that class
hierarchy of a type never span to dependent C++ libraries (one is
libstdc++), which would result in incorrect devirtualization. An
example is given below to demonstrate the problem.

// Has been pre-compiled to a library
class Base {

Re: [PATCH] Fix loop split incorrect count and probability

2021-08-10 Thread Feng Xue OS via Gcc-patches

Any transformation involving cfg alteration would face same problem,
it is not that easy to update new cfg with reasonable and seemly-correct
profile count. We can adjust probability for impacted condition bbs, but
lack of a utility like what static profile estimating pass does, and only
propagates count partially.

Thanks,
Feng


From: Richard Biener 
Sent: Tuesday, August 10, 2021 10:47 PM
To: Xionghu Luo
Cc: gcc-patches@gcc.gnu.org; seg...@kernel.crashing.org; Feng Xue OS; 
wschm...@linux.ibm.com; guoji...@linux.ibm.com; li...@gcc.gnu.org; 
hubi...@ucw.cz
Subject: Re: [PATCH] Fix loop split incorrect count and probability

On Mon, 9 Aug 2021, Xionghu Luo wrote:

> Thanks,
>
> On 2021/8/6 19:46, Richard Biener wrote:
> > On Tue, 3 Aug 2021, Xionghu Luo wrote:
> >
> >> loop split condition is moved between loop1 and loop2, the split bb's
> >> count and probability should also be duplicated instead of (100% vs INV),
> >> secondly, the original loop1 and loop2 count need be propotional from the
> >> original loop.
> >>
> >>
> >> diff base/loop-cond-split-1.c.151t.lsplit  
> >> patched/loop-cond-split-1.c.151t.lsplit:
> >> ...
> >> int prephitmp_16;
> >> int prephitmp_25;
> >>
> >>  [local count: 118111600]:
> >> if (n_7(D) > 0)
> >>   goto ; [89.00%]
> >> else
> >>   goto ; [11.00%]
> >>
> >>  [local count: 118111600]:
> >> return;
> >>
> >>  [local count: 105119324]:
> >> pretmp_3 = ga;
> >>
> >> -   [local count: 955630225]:
> >> +   [local count: 315357973]:
> >> # i_13 = PHI 
> >> # prephitmp_12 = PHI 
> >> if (prephitmp_12 != 0)
> >>   goto ; [33.00%]
> >> else
> >>   goto ; [67.00%]
> >>
> >> -   [local count: 315357972]:
> >> +   [local count: 104068130]:
> >> _2 = do_something ();
> >> ga = _2;
> >>
> >> -   [local count: 955630225]:
> >> +   [local count: 315357973]:
> >> # prephitmp_5 = PHI 
> >> i_10 = inc (i_13);
> >> if (n_7(D) > i_10)
> >>   goto ; [89.00%]
> >> else
> >>   goto ; [11.00%]
> >>
> >>  [local count: 105119324]:
> >> goto ; [100.00%]
> >>
> >> -   [local count: 850510901]:
> >> +   [local count: 280668596]:
> >> if (prephitmp_12 != 0)
> >> -goto ; [100.00%]
> >> +goto ; [33.00%]
> >> else
> >> -goto ; [INV]
> >> +goto ; [67.00%]
> >>
> >> -   [local count: 850510901]:
> >> +   [local count: 280668596]:
> >> goto ; [100.00%]
> >>
> >> -   [count: 0]:
> >> +   [local count: 70429947]:
> >> # i_23 = PHI 
> >> # prephitmp_25 = PHI 
> >>
> >> -   [local count: 955630225]:
> >> +   [local count: 640272252]:
> >> # i_15 = PHI 
> >> # prephitmp_16 = PHI 
> >> i_22 = inc (i_15);
> >> if (n_7(D) > i_22)
> >>   goto ; [89.00%]
> >> else
> >>   goto ; [11.00%]
> >>
> >> -   [local count: 850510901]:
> >> +   [local count: 569842305]:
> >> goto ; [100.00%]
> >>
> >>   }
> >>
> >> gcc/ChangeLog:
> >>
> >>* tree-ssa-loop-split.c (split_loop): Fix incorrect probability.
> >>(do_split_loop_on_cond): Likewise.
> >> ---
> >>   gcc/tree-ssa-loop-split.c | 16 
> >>   1 file changed, 8 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> >> index 3a09bbc39e5..8e5a7ded0f7 100644
> >> --- a/gcc/tree-ssa-loop-split.c
> >> +++ b/gcc/tree-ssa-loop-split.c
> >> @@ -583,10 +583,10 @@ split_loop (class loop *loop1)
> >>basic_block cond_bb;
>
>   if (!initial_true)
> -   cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> +   cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> +
> + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
> +? EDGE_SUCC (bbs[i], 0)
> +: EDGE_SUCC (bbs[i], 1);
>
> >>
> >>class loop *loop2 = loop_version (loop1, cond, _bb,
> >> - profile_probability::always (),
> >> - profile_probability::always (),
> >> - profile_probability::always (),
> >> - profile_probability::always (),
> >> + true_edge->probability,
> >> + true_edge->probability.invert (),
> >> + true_edge->probability,
> >> + true_edge->probability.invert (),
> >>   true);
> >
> > there is no 'true_edge' variable at this point.
>
> Sorry, missed the above hunk when split the patch.
>
> >
> >>gcc_assert (loop2);
> >>
> >> @@ -1486,10 +1486,10 @@ do_split_loop_on_cond (struct loop *loop1, edge 
> >> invar_branch)
> >> initialize_original_copy_tables ();
> >>
> >> struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
> >> -   profile_probability::always (),
> >>

Re: [PATCH] Fix loop split incorrect count and probability

2021-08-08 Thread Feng Xue OS via Gcc-patches

Yes. Condition to to switch two versioned loops is "true", the first two 
arguments should be 100% and 0%.

It is different from normal loop split, we could not deduce exactly precise 
probability for
condition-based loop split, since cfg inside loop2 would be changed. 
(invar-branch is replaced
to "true", as shown in the comment on do_split_loop_on_cond). Any way, your way 
of scaling
two loops' probabilities according to that of invar-branch, seems to be a 
better heuristics than
original, which would give us more reasonable execution count, at least for 
loop header bb.

Thanks,
Feng


From: Gcc-patches  
on behalf of Richard Biener via Gcc-patches 
Sent: Friday, August 6, 2021 7:46 PM
To: Xionghu Luo
Cc: seg...@kernel.crashing.org; wschm...@linux.ibm.com; li...@gcc.gnu.org; 
gcc-patches@gcc.gnu.org; hubi...@ucw.cz; dje@gmail.com
Subject: Re: [PATCH] Fix loop split incorrect count and probability

On Tue, 3 Aug 2021, Xionghu Luo wrote:

> loop split condition is moved between loop1 and loop2, the split bb's
> count and probability should also be duplicated instead of (100% vs INV),
> secondly, the original loop1 and loop2 count need be propotional from the
> original loop.
>
> Regression tested pass, OK for master?
>
> diff base/loop-cond-split-1.c.151t.lsplit  
> patched/loop-cond-split-1.c.151t.lsplit:
> ...
>int prephitmp_16;
>int prephitmp_25;
>
> [local count: 118111600]:
>if (n_7(D) > 0)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> [local count: 118111600]:
>return;
>
> [local count: 105119324]:
>pretmp_3 = ga;
>
> -   [local count: 955630225]:
> +   [local count: 315357973]:
># i_13 = PHI 
># prephitmp_12 = PHI 
>if (prephitmp_12 != 0)
>  goto ; [33.00%]
>else
>  goto ; [67.00%]
>
> -   [local count: 315357972]:
> +   [local count: 104068130]:
>_2 = do_something ();
>ga = _2;
>
> -   [local count: 955630225]:
> +   [local count: 315357973]:
># prephitmp_5 = PHI 
>i_10 = inc (i_13);
>if (n_7(D) > i_10)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> [local count: 105119324]:
>goto ; [100.00%]
>
> -   [local count: 850510901]:
> +   [local count: 280668596]:
>if (prephitmp_12 != 0)
> -goto ; [100.00%]
> +goto ; [33.00%]
>else
> -goto ; [INV]
> +goto ; [67.00%]
>
> -   [local count: 850510901]:
> +   [local count: 280668596]:
>goto ; [100.00%]
>
> -   [count: 0]:
> +   [local count: 70429947]:
># i_23 = PHI 
># prephitmp_25 = PHI 
>
> -   [local count: 955630225]:
> +   [local count: 640272252]:
># i_15 = PHI 
># prephitmp_16 = PHI 
>i_22 = inc (i_15);
>if (n_7(D) > i_22)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> -   [local count: 850510901]:
> +   [local count: 569842305]:
>goto ; [100.00%]
>
>  }
>
> gcc/ChangeLog:
>
>   * tree-ssa-loop-split.c (split_loop): Fix incorrect probability.
>   (do_split_loop_on_cond): Likewise.
> ---
>  gcc/tree-ssa-loop-split.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> index 3a09bbc39e5..8e5a7ded0f7 100644
> --- a/gcc/tree-ssa-loop-split.c
> +++ b/gcc/tree-ssa-loop-split.c
> @@ -583,10 +583,10 @@ split_loop (class loop *loop1)
>   basic_block cond_bb;
>
>   class loop *loop2 = loop_version (loop1, cond, _bb,
> -profile_probability::always (),
> -profile_probability::always (),
> -profile_probability::always (),
> -profile_probability::always (),
> +true_edge->probability,
> +true_edge->probability.invert (),
> +true_edge->probability,
> +true_edge->probability.invert (),
>  true);

there is no 'true_edge' variable at this point.

>   gcc_assert (loop2);
>
> @@ -1486,10 +1486,10 @@ do_split_loop_on_cond (struct loop *loop1, edge 
> invar_branch)
>initialize_original_copy_tables ();
>
>struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
> -  profile_probability::always (),
> -  profile_probability::never (),
> -  profile_probability::always (),
> -  profile_probability::always (),
> +  invar_branch->probability.invert (),
> +  invar_branch->probability,
> +  invar_branch->probability.invert (),
> +  invar_branch->probability,
>true);
>if (!loop2)
>  {

Question about non-POD class type

2021-05-14 Thread Feng Xue OS via Gcc-patches

For an instance of a non-POD class, can I always assume that any
operation on it should be type-safe, any wrong or even trick code
to violate this is UB in C++ spec? For example, here are some ways:

 union {
Type1  *p1;
Type2  *p2;
};

or 

union {
Type1 t1;
Type2 t2;
};

or

void *p = Type1 *p1;
Type2 *p2 = p;
p2->xxx;

Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

2021-05-06 Thread Feng Xue OS via Gcc-patches

>> gcc/
>> PR tree-optimization/98598
>> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
>> * common.opt (-ftree-loop-mgo): New option.
> 
> Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a 
> good
> name.  -floop-mgo would be better but still I'd have no idea what this would 
> do.
> 
> I don't have a good suggestion here other than to expand it to
> -floop-gather-memory (?!).

OK. Better than "mgo", this abbr. is only a term for development use.

> The option documentation isn't informative either.
> 
> From:
> 
>   outer-loop ()
> {
>   inner-loop (iter, iter_count)
> {
>   Type1 v1 = LOAD (iter);
>   Type2 v2 = LOAD (v1);
>   Type3 v3 = LOAD (v2);
>   ...
>   iter = NEXT (iter);
> }
> }
> 
> To:
> 
>   typedef struct cache_elem
> {
>   bool   init;
>   Type1  c_v1;
>   Type2  c_v2;
>   Type3  c_v3;
> } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
> 
>   outer-loop ()
> {
>   size_t cache_idx = 0;
> 
>   inner-loop (iter, iter_count)
> {
>   if (!cache_arr[cache_idx]->init)
> {
>   v1 = LOAD (iter);
>   v2 = LOAD (v1);
>   v3 = LOAD (v2);
> 
>   cache_arr[cache_idx]->init = true;
>   cache_arr[cache_idx]->c_v1 = v1;
>   cache_arr[cache_idx]->c_v2 = v2;
>   cache_arr[cache_idx]->c_v3 = v3;
> }
>   else
> {
>   v1 = cache_arr[cache_idx]->c_v1;
>   v2 = cache_arr[cache_idx]->c_v2;
>   v3 = cache_arr[cache_idx]->c_v3;
> }
>   ...
>   cache_idx++;
>   iter = NEXT (iter);
> }
> }
> 
>   free (cache_arr);
> 
> This is a _very_ special transform.  What it seems to do is
> optimize the dependent loads for outer loop iteration n > 1
> by caching the result(s).  If that's possible then you should
> be able to distribute the outer loop to one doing the caching
> and one using the cache.  Then this transform would be more
> like a tradidional array expansion of scalars?  In some cases
> also loop interchange could remove the need for the caching.
> 
> Doing MGO as the very first loop pass thus looks bad, I think
> MGO should be much later, for example after interchange.
> I also think that MGO should work in concert with loop
> distribution (which could have an improved cost model)
> rather than being a separate pass.
> 
> Your analysis phase looks quite expensive, building sth
> like a on-the side representation very closely matching SSA.
> It seems to work from PHI defs to uses, which looks backwards.

Did not catch this point very clearly. Would you please detail it more?

> You seem to roll your own dependence analysis code :/  Please
> have a look at loop distribution.
> 
> Also you build an actual structure type for reasons that escape
> me rather than simply accessing the allocated storage at
> appropriate offsets.
> 
> I think simply calling 'calloc' isn't OK because you might need
> aligned storage and because calloc might not be available.
> Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT
> is large enough for the data you want to place (or perform
> dynamic re-alignment yourself).  We probably want some generic
> middle-end utility to obtain aligned allocated storage at some
> point.
> 
> As said above I think you want to re-do this transform as
> a loop distribution transform.  I think if caching works then
> the loads should be distributable and the loop distribution
> transform should be enhanced to expand the scalars to arrays.

I checked code of loop distribution, and its trigger strategy seems
to be very conservative, now only targets simple and regular
index-based loop, and could not handle link-list traversal, which
consists of a series of discrete memory accesses, and MGO would
matter a lot. Additionally, for some complicate cases,  we could
not completely decompose MGO as two separate loops for
"do caching" and "use caching" respectively. An example:

for (i = 0; i < N; i++)
  {
for (j = 0; j < i; j++)
   {
   Type1 v1 = LOAD_FN1 (j);
   Type2 v2 = LOAD_FN2 (v1);
   Type3 v3 = LOAD_FN3 (v2);

   ...

   condition = ...
   }

if (condition)
  break;
  }

We should not cache all loads (Totally N) in one step since some
of them might be invalid after "condition" breaks loops. We have to
mix up "do caching" and "use caching", and let them dynamically
switched against "init" flag.  But loop distribution does have some
overlap on analysis and transformation with MGO, we will try to
see if there is a way to unify them.

Thanks,
Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

2021-04-29 Thread Feng Xue OS via Gcc-patches

>> This patch implements a new loop optimization according to the proposal
>> in RFC given at
>> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
>> So do not repeat the idea in this mail. Hope your comments on it.
> 
> With the caveat that I'm not an optimization expert (but no one else
> seems to have replied), here are some thoughts.
> 
> [...snip...]
> 
>> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
>>  [PR98598]
> 
> BTW, did you mean to also post patches 2 and 3?
>

Not yet, but they are ready. Since this is kind of special optimization that 
uses
heap as temporary storage, not a common means in gcc, we do not know
basic attitude of the community towards it. So only the first patch was sent
out for initial comments, in that it implements a generic MGO framework, and
is complete and self-contained. Other 2 patches just composed some
enhancements for specific code pattern and dynamic alias check. If possible,
this proposal would be accepted principally, we will submit other 2 for review.

> 
>> In nested loops, if scattered memory accesses inside inner loop remain
>> unchanged in outer loop, we can sequentialize these loads by caching
>> their values into a temporary memory region at the first time, and
>> reuse the caching data in following iterations. This way can improve
>> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> 
> I don't think you've cited any performance numbers so far.  Does the
> optimization show a measurable gain on some benchmark(s)?  e.g. is this
> ready to run SPEC yet, and how does it do?

Yes, we have done that. Minor improvement about several point percentage
could gain for some real applications. And to be specific, we also get major
improvement as more than 30% for certain benchmark in SPEC2017.

> 
>> To illustrate what the optimization will do, two pieces of pseudo code,
>> before and after transformation, are given. Suppose all loads and
>> "iter_count" are invariant in outer loop.
>>
>> From:
>>
>>   outer-loop ()
>> {
>>   inner-loop (iter, iter_count)
>> {
>>   Type1 v1 = LOAD (iter);
>>   Type2 v2 = LOAD (v1);
>>   Type3 v3 = LOAD (v2);
>>   ...
>>   iter = NEXT (iter);
>> }
>> }
>>
>> To:
>>
>>   typedef struct cache_elem
>> {
>>   bool   init;
>>   Type1  c_v1;
>>   Type2  c_v2;
>>   Type3  c_v3;
> 
> Putting the "bool init;" at the front made me think "what about
> packing?" but presumably the idea is that every element is accessed in
> order, so it presumably benefits speed to have "init" at the top of the
> element, right?

Yes, layout of the struct layout could be optimized in terms of size by
some means, such as:
  o. packing "init" into a padding hole after certain field
  o. if certain field is a pointer type, the field can take the role of "init"
  (Non-NULL implies "initialized")
Now this simple scheme is straightforward, and would be enhanced
in various aspects later.

>> } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));

> What if the allocation fails at runtime?  Do we keep an unoptimized
> copy of the nested loops around as a fallback and have an unlikely
> branch to that copy?

Yes, we should. But in a different way, a flag is added into original
nested loop to control runtime switch between optimized and
unoptimized execution. This definitely incurs runtime cost, but 
avoid possible code size bloating. A better handling, as a TODO is
to apply dynamic-switch for large loop, and loop-clone for small one.

> I notice that you're using calloc, presumably to clear all of the
> "init" flags (and the whole buffer).
> 
> FWIW, this feels like a case where it would be nice to have a thread-
> local heap allocation, perhaps something like an obstack implemented in
> the standard library - but that's obviously scope creep for this.

Yes, that's good, specially for many-thread application.

> Could it make sense to use alloca for small allocations?  (or is that
> scope creep?)

We did consider using alloca as you said.  But if we could not determine
up limit for a non-constant size, we have to place alloca inside a loop that
encloses the nested loop. Without a corresponding free operation, this
kind of alloca-in-loop might cause stack overflow. So it becomes another
TODO.

>>   outer-loop ()
>> {
>>   size_t cache_idx = 0;
>>
>>   inner-loop (iter, iter_count)
>> {
>>   if (!cache_arr[cache_idx]->init)
>> {
>>   v1 = LOAD (iter);
>>   v2 = LOAD (v1);
>>   v3 = LOAD (v2);
>>
>>   cache_arr[cache_idx]->init = true;
>>   cache_arr[cache_idx]->c_v1 = v1;
>>   cache_arr[cache_idx]->c_v2 = v2;
>>   cache_arr[cache_idx]->c_v3 = v3;
>> }
>> else
>> {
>>   v1 = cache_arr[cache_idx]->c_v1;
>>

[PATCH] Fix testcases to avoid plusminus-with-convert pattern (PR 97066)

2020-09-16 Thread Feng Xue OS via Gcc-patches

With the new pattern rule (T)(A) +- (T)(B) -> (T)(A +- B),
some testcases are simplified and could not keep expected
code pattern as test-check. Minor changes are made to those
cases to avoid simplification effect of the rule.

Tested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-16  Feng Xue  

gcc/testsuite/
PR testsuite/97066
* gcc.dg/ifcvt-3.c: Modified to suppress simplification.
* gcc.dg/tree-ssa/20030807-10.c: Likewise.From ac768c385f1332e276260c6de83b12929180fbfb Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Wed, 16 Sep 2020 16:21:14 +0800
Subject: [PATCH] testsuite/97066 - minor change to bypass
 plusminus-with-convert rule

The following testcases will be simplified by the new rule
(T)(A) +- (T)(B) -> (T)(A +- B), so could not keep expected code pattern
as test-check. Adjust test code to suppress simplification.

2020-09-16  Feng Xue  

gcc/testsuite/
	PR testsuite/97066
	* gcc.dg/ifcvt-3.c: Modified to suppress simplification.
	* gcc.dg/tree-ssa/20030807-10.c: Likewise.
---
 gcc/testsuite/gcc.dg/ifcvt-3.c  | 2 +-
 gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/ifcvt-3.c b/gcc/testsuite/gcc.dg/ifcvt-3.c
index b250bc15e08..56fdd753a0a 100644
--- a/gcc/testsuite/gcc.dg/ifcvt-3.c
+++ b/gcc/testsuite/gcc.dg/ifcvt-3.c
@@ -11,7 +11,7 @@ foo (s64 a, s64 b, s64 c)
   if (d == 0)
 return a + c;
   else
-return b + d + c;
+return b + c + d;
 }
 
 /* This test can be reduced to just return a + c;  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c b/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c
index 0903f3c4321..0e01e511b78 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c
@@ -7,7 +7,7 @@ unsigned int
 subreg_highpart_offset (outermode, innermode)
  int outermode, innermode;
 {
-  unsigned int offset = 0;
+  unsigned int offset = 1;
   int difference = (mode_size[innermode] - mode_size[outermode]);
   if (difference > 0)
 {
-- 
2.17.1

Re: [PATCH 2/2 V4] Add plusminus-with-convert pattern (PR 94234)

2020-09-15 Thread Feng Xue OS via Gcc-patches

>> Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B)
>> could be folded to a simple value. By this rule, a 
>> plusminus-mult-with-convert
>> expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B).
>
>Please use INTEGRAL_TYPE_P () instead of TREE_CODE == INTEGER_TYPE
>in all three cases.  It's enough to check for INTEGRAL_TYPE_P on one operand,
>the types_match will take care of the other.

I would have considered using INTEGRAL_TYPE_P(), but if inner type is bool or
enum, can we do plus/minus operation on that?

Feng

>
>OK with those changes.
>
>Thanks,
>Richard.
>
>
> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>
> Feng
> ---
> 2020-09-15  Feng Xue  
>
> gcc/
> PR tree-optimization/94234
> * match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification.
>
> gcc/testsuite/
> PR tree-optimization/94234
> * gcc.dg/pr94234-3.c: New test.

Re: Ping: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-15 Thread Feng Xue OS via Gcc-patches

>> This patch is to handle simplification of plusminus-mult-with-convert 
>> expression
>> as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
>> multiplication.
>> This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
>> resort
>> to gimple-matcher to fold (X +- Y) instead of manually code pattern 
>> recognition.
>
>I still don't like the complete new function with all its correctness
>issues - the existing
>fold_plusminus_mult_expr was difficult enough to get correct for
>corner cases and
>we do have a set of match.pd patterns (partly?) implementing its transforms.
>
>Looking at
>
>+unsigned goo (unsigned m_param, unsigned n_param)
>+{
>+  unsigned b1 = m_param * (n_param + 2);
>+  unsigned b2 = m_param * (n_param + 1);
>+  int r = (int)(b1) - (int)(b2);
>
>it seems we want to simplify (signed)A - (signed)B to
>(signed)(A - B) if A - B "simplifies"?  I guess
>
>(simplify
>  (plusminus (nop_convert @0) (nop_convert? @1))
>  (convert (plusminus! @0 @1)))
>
>probably needs a swapped pattern or not iterate over plus/minus
>to handle at least one converted operand and avoid adding
>a (plus @0 @1) -> (convert (plus! @0 @1)) rule.
>
>Even
>
>(simplify
> (minus (nop_convert @0) (nop_convert @1))
> (convert (minus! @0 @1)))
>
>seems to handle all your testcases already (which means
>they are all the same and not very exhaustive...)
Yes. This is much simpler.

Thanks,
Feng

>Richard.
>
>
>> Regards,
>> Feng
>> ---
>> 2020-09-03  Feng Xue  
>>
>> gcc/
>> PR tree-optimization/94234
>> * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
>> function.
>> (fwprop_ssa_val): Move it before its new caller.
>> (pass_forwprop::execute): Add call to
>> simplify_plusminus_mult_with_convert.
>>
>> gcc/testsuite/
>> PR tree-optimization/94234
>> * gcc.dg/pr94234-3.c: New test.
>

[PATCH 2/2 V4] Add plusminus-with-convert pattern (PR 94234)

2020-09-15 Thread Feng Xue OS via Gcc-patches

Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B)
could be folded to a simple value. By this rule, a plusminus-mult-with-convert
expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B).

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-15  Feng Xue  

gcc/
PR tree-optimization/94234
* match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.From f7c7483bd61fe1e3d6888f84d718fb4be4ea9e14 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - add plusminus-with-convert pattern

Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B)
could be folded to a simple value. By this rule, a plusminus-mult-with-convert
expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B).

2020-09-15  Feng Xue  

gcc/
	PR tree-optimization/94234
	* match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/match.pd | 16 
 gcc/testsuite/gcc.dg/pr94234-3.c | 42 
 2 files changed, 58 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 46fd880bd37..d8c59fad9c1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2397,6 +2397,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(plus (convert @0) (op @2 (convert @1))
 #endif
 
+/* (T)(A) +- (T)(B) -> (T)(A +- B) only when (A +- B) could be simplified
+   to a simple value.  */
+#if GIMPLE
+  (for op (plus minus)
+   (simplify
+(op (convert @0) (convert @1))
+ (if (TREE_CODE (type) == INTEGER_TYPE
+	  && TREE_CODE (TREE_TYPE (@0)) == INTEGER_TYPE
+	  && TREE_CODE (TREE_TYPE (@1)) == INTEGER_TYPE
+	  && TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@0))
+	  && types_match (TREE_TYPE (@0), TREE_TYPE (@1))
+	  && !TYPE_OVERFLOW_TRAPS (type)
+	  && !TYPE_OVERFLOW_SANITIZED (type))
+  (convert (op! @0 @1)
+#endif
+
   /* ~A + A -> -1 */
   (simplify
(plus:c (bit_not @0) @0)
diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
-- 
2.17.1

Re: Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

2020-09-14 Thread Feng Xue OS via Gcc-patches

>@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool
>gimple, operand *result)
>  /* Re-fold the toplevel result.  It's basically an embedded
> gimple_build w/o actually building the stmt.  */
>  if (!is_predicate)
>-   fprintf_indent (f, indent,
>-   "res_op->resimplify (lseq, valueize);\n");
>+   {
>+ fprintf_indent (f, indent,
>+ "res_op->resimplify (lseq, valueize);\n");
>+ if (e->force_leaf)
>+   {
>+ fprintf_indent (f, indent,
>+ "if (!maybe_push_res_to_seq (res_op, NULL))\n");
>+ fprintf_indent (f, indent + 2, "return false;\n");
>
>please use "goto %s;\n", fail_label)  here.  OK with that change.
Ok.

>
>I've tried again to think about sth prettier to cover these kind of
>single-use checks but failed to come up with sth.
Maybe we need a smart combiner that can deduce cost globally, and
remove these single-use specifiers from rule description.

Feng


From: Richard Biener 
Sent: Monday, September 14, 2020 9:39 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands 
(PR 94234)

On Mon, Sep 14, 2020 at 5:17 AM Feng Xue OS via Gcc-patches
 wrote:
>
> Thanks,

@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool
gimple, operand *result)
  /* Re-fold the toplevel result.  It's basically an embedded
 gimple_build w/o actually building the stmt.  */
  if (!is_predicate)
-   fprintf_indent (f, indent,
-   "res_op->resimplify (lseq, valueize);\n");
+   {
+ fprintf_indent (f, indent,
+ "res_op->resimplify (lseq, valueize);\n");
+ if (e->force_leaf)
+   {
+ fprintf_indent (f, indent,
+ "if (!maybe_push_res_to_seq (res_op, NULL))\n");
+ fprintf_indent (f, indent + 2, "return false;\n");

please use "goto %s;\n", fail_label)  here.  OK with that change.

I've tried again to think about sth prettier to cover these kind of
single-use checks but failed to come up with sth.

Thanks and sorry for the delay,
Richard.

> Feng
>
> 
> From: Feng Xue OS
> Sent: Thursday, September 3, 2020 2:06 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 
> 94234)
>
> For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled
> when A and B are not single-use. This patch is a minor enhancement
> on the pattern, which allows folding if final result is found to be a
> simple gimple value (constant/existing SSA).
>
> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>
> Feng
> ---
> 2020-09-03  Feng Xue  
>
> gcc/
> PR tree-optimization/94234
> * genmatch.c (dt_simplify::gen_1): Emit check on final simplification
> result when "!" is specified on toplevel output expr.
> * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
> expr with multi-use operands if final result is a simple gimple value.
>
> gcc/testsuite/
> PR tree-optimization/94234
> * gcc.dg/pr94234-2.c: New test.
> ---

Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

2020-09-13 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS
Sent: Thursday, September 3, 2020 2:06 PM
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled
when A and B are not single-use. This patch is a minor enhancement
on the pattern, which allows folding if final result is found to be a
simple gimple value (constant/existing SSA).

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
result when "!" is specified on toplevel output expr.
* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-2.c: New test.
---
From e247eb0d9a43856cc0b46f98414ed58d13796d62 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Tue, 1 Sep 2020 17:17:58 +0800
Subject: [PATCH] tree-optimization/94234 - Fold plusminus_mult expr with
 multi-use operands

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
	result when "!" is specified on toplevel output expr.
	* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
	expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
	PR tree-optimization/94234
	* gcc.dg/pr94234-2.c: New test.
---
 gcc/genmatch.c   | 12 --
 gcc/match.pd | 22 ++
 gcc/testsuite/gcc.dg/pr94234-2.c | 39 
 3 files changed, 62 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-2.c

diff --git a/gcc/genmatch.c b/gcc/genmatch.c
index 906d842c4d8..d4f01401964 100644
--- a/gcc/genmatch.c
+++ b/gcc/genmatch.c
@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result)
 	  /* Re-fold the toplevel result.  It's basically an embedded
 	 gimple_build w/o actually building the stmt.  */
 	  if (!is_predicate)
-	fprintf_indent (f, indent,
-			"res_op->resimplify (lseq, valueize);\n");
+	{
+	  fprintf_indent (f, indent,
+			  "res_op->resimplify (lseq, valueize);\n");
+	  if (e->force_leaf)
+		{
+		  fprintf_indent (f, indent,
+		  "if (!maybe_push_res_to_seq (res_op, NULL))\n");
+		  fprintf_indent (f, indent + 2, "return false;\n");
+		}
+	}
 	}
   else if (result->type == operand::OP_CAPTURE
 	   || result->type == operand::OP_C_EXPR)
diff --git a/gcc/match.pd b/gcc/match.pd
index 6e45836e32b..46fd880bd37 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2570,15 +2570,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (for plusminus (plus minus)
   (simplify
(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
-   (if ((!ANY_INTEGRAL_TYPE_P (type)
-	 || TYPE_OVERFLOW_WRAPS (type)
-	 || (INTEGRAL_TYPE_P (type)
-	 && tree_expr_nonzero_p (@0)
-	 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
-	/* If @1 +- @2 is constant require a hard single-use on either
-	   original operand (but not on both).  */
-	&& (single_use (@3) || single_use (@4)))
-(mult (plusminus @1 @2) @0)))
+   (if (!ANY_INTEGRAL_TYPE_P (type)
+	|| TYPE_OVERFLOW_WRAPS (type)
+	|| (INTEGRAL_TYPE_P (type)
+	&& tree_expr_nonzero_p (@0)
+	&& expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
+(if (single_use (@3) || single_use (@4))
+ /* If @1 +- @2 is constant require a hard single-use on either
+	original operand (but not on both).  */
+ (mult (plusminus @1 @2) @0)
+#if GIMPLE
+ (mult! (plusminus @1 @2) @0)
+#endif
+  )))
   /* We cannot generate constant 1 for fract.  */
   (if (!ALL_FRACT_MODE_P (TYPE_MODE (type)))
(simplify
diff --git a/gcc/testsuite/gcc.dg/pr94234-2.c b/gcc/testsuite/gcc.dg/pr94234-2.c
new file mode 100644
index 000..1f4b194dd43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-2.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */ 
+
+int use_fn (int a);
+
+int foo (int n)
+{
+  int b1 = 8 * (n + 1);
+  int b2 = 8 * n;
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned hoo (unsigned k_param)
+{
+  unsigned b1 = k_param * 28;
+  unsigned b2 = k_param * 15;
+  unsigned b3 = k_param * 12;
+
+  use_fn (b1 ^ b2 ^ b3);
+
+  return (b1 - b2) - b3;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-not "return k_param" "forwprop1" } } */

Ping: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-13 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS 
Sent: Thursday, September 3, 2020 5:29 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in 
forwprop (PR 94234)

Attach patch file.

Feng

From: Gcc-patches  on behalf of Feng Xue OS 
via Gcc-patches 
Sent: Thursday, September 3, 2020 5:27 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop 
(PR 94234)

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication.
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.
From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - simplify
 plusminus-mult-with-convert in forwprop

For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of
multification, try to transform it to (T) (X +- Y), and apply simplification
on (X +- Y) if possible. In this way, we can avoid creating almost duplicated
rule to handle plusminus-mult-with-convert variant.

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
	function.
	(fwprop_ssa_val): Move it before its new caller.
	(pass_forwprop::execute): Add call to
	simplify_plusminus_mult_with_convert.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/testsuite/gcc.dg/pr94234-3.c |  42 
 gcc/tree-ssa-forwprop.c  | 168 +++
 2 files changed, 191 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index e2d008dfb92..7b9d46ec919 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name)
   return cfg_changed;
 }
 
+/* Primitive "lattice" function for gimple_simplify.  */
+
+static tree
+fwprop_ssa_val (tree name)
+{
+  /* First valueize NAME.  */
+  if (TREE_CODE (name) == SSA_NAME
+  && SSA_NAME_VERSION (name) < lattice.length ())
+{
+  tree val = lattice[SSA_NAME_VERSION (name)];
+  if (val)
+	name = val;
+}
+  /* We continue matching along SSA use-def edges for SSA names
+ that are not single-use.  Currently there are no patterns
+ that would cause any issues with that.  */
+  return name;
+}
+
 /* Return the rhs of a gassign *STMT in a form of a single tree,
converted to type TYPE.
 
@@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of
+   multiplication, if the expr can be transformed to (T) (X +- Y) in terms of
+   two's complement computation, apply simplification on (X +- Y) if it is
+   possible.  As a prerequisite, outer result type (T) has precision not more
+   than that of inner operand type.  */
+
+static bool
+simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*g

Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-09-04 Thread Feng Xue OS via Gcc-patches

>> Hi,
>>
>> On Mon, Aug 31 2020, Feng Xue OS wrote:
>> > This patch is to fix a bug that cost that is used to evaluate clone 
>> > candidate
>> > becomes negative due to integer overflow.
>> >
>> > Feng
>> > ---
>> > 2020-08-31  Feng Xue  
>> >
>> > gcc/
>> > PR tree-optimization/96806
>>
>> the component is "ipa," please change that when you commit the patch.
>>
>> > * ipa-cp.c (decide_about_value): Use safe_add to avoid cost 
>> > addition
>> > overflow.
>>
>> assuming you have bootstrapped and tested it, it is OK for both trunk
>> and all affected release branches.
>
>I have already added caps on things that come from profile counts so
>things do not overflow, but I think in longer run we want to simply use
>sreals here..
>> >&& !good_cloning_opportunity_p (node,
>> > - val->local_time_benefit
>> > - + val->prop_time_benefit,
>> > + safe_add (val->local_time_benefit,
>> > +   val->prop_time_benefit),
>> >   freq_sum, count_sum,
>> > - val->local_size_cost
>> > - + val->prop_size_cost))
>> > + safe_add (val->local_size_cost,
>> > +   val->prop_size_cost)))
>
>Is it also size cost that may overflow? That seem bit odd ;)
>

Yes. prop_size_cost accumulates all callees' size_cost. And since
there are two recursive calls, this value increases exponentially
as 2's power, and easily exceeds value space of integer.

It is actually a defect of cost computation for recursive cloning.
But I think we need a complete consideration on how to adjust
cost model for recursive cloning, including profile estimation,
threshold, size_cost...

And a quick fix is to add a cap here to avoid overflow.

Feng

>Honza
>> >  return false;
>> >
>> >if (dump_file)
>>
>> [...]
>

Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches

Attach patch file.

Feng

From: Gcc-patches  on behalf of Feng Xue OS 
via Gcc-patches 
Sent: Thursday, September 3, 2020 5:27 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop 
(PR 94234)

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication.
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.
From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - simplify
 plusminus-mult-with-convert in forwprop

For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of
multification, try to transform it to (T) (X +- Y), and apply simplification
on (X +- Y) if possible. In this way, we can avoid creating almost duplicated
rule to handle plusminus-mult-with-convert variant.

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
	function.
	(fwprop_ssa_val): Move it before its new caller.
	(pass_forwprop::execute): Add call to
	simplify_plusminus_mult_with_convert.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/testsuite/gcc.dg/pr94234-3.c |  42 
 gcc/tree-ssa-forwprop.c  | 168 +++
 2 files changed, 191 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index e2d008dfb92..7b9d46ec919 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name)
   return cfg_changed;
 }
 
+/* Primitive "lattice" function for gimple_simplify.  */
+
+static tree
+fwprop_ssa_val (tree name)
+{
+  /* First valueize NAME.  */
+  if (TREE_CODE (name) == SSA_NAME
+  && SSA_NAME_VERSION (name) < lattice.length ())
+{
+  tree val = lattice[SSA_NAME_VERSION (name)];
+  if (val)
+	name = val;
+}
+  /* We continue matching along SSA use-def edges for SSA names
+ that are not single-use.  Currently there are no patterns
+ that would cause any issues with that.  */
+  return name;
+}
+
 /* Return the rhs of a gassign *STMT in a form of a single tree,
converted to type TYPE.
 
@@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of
+   multiplication, if the expr can be transformed to (T) (X +- Y) in terms of
+   two's complement computation, apply simplification on (X +- Y) if it is
+   possible.  As a prerequisite, outer result type (T) has precision not more
+   than that of inner operand type.  */
+
+static bool
+simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree lhs = gimple_assign_lhs (stmt);
+  tree rtype = TREE_TYPE (lhs);
+  tree ctype = NULL_TREE;
+  enum tree_code code = gimple_assign_rhs_code (stmt);
+
+  if (code != PLUS_EXPR && code != MINUS_EXPR)
+return false;
+

[PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication. 
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.

[PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches

For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled
when A and B are not single-use. This patch is a minor enhancement
on the pattern, which allows folding if final result is found to be a
simple gimple value (constant/existing SSA).

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
result when "!" is specified on toplevel output expr.
* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-2.c: New test.
---From e247eb0d9a43856cc0b46f98414ed58d13796d62 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Tue, 1 Sep 2020 17:17:58 +0800
Subject: [PATCH] tree-optimization/94234 - Fold plusminus_mult expr with
 multi-use operands

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
	result when "!" is specified on toplevel output expr.
	* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
	expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
	PR tree-optimization/94234
	* gcc.dg/pr94234-2.c: New test.
---
 gcc/genmatch.c   | 12 --
 gcc/match.pd | 22 ++
 gcc/testsuite/gcc.dg/pr94234-2.c | 39 
 3 files changed, 62 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-2.c

diff --git a/gcc/genmatch.c b/gcc/genmatch.c
index 906d842c4d8..d4f01401964 100644
--- a/gcc/genmatch.c
+++ b/gcc/genmatch.c
@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result)
 	  /* Re-fold the toplevel result.  It's basically an embedded
 	 gimple_build w/o actually building the stmt.  */
 	  if (!is_predicate)
-	fprintf_indent (f, indent,
-			"res_op->resimplify (lseq, valueize);\n");
+	{
+	  fprintf_indent (f, indent,
+			  "res_op->resimplify (lseq, valueize);\n");
+	  if (e->force_leaf)
+		{
+		  fprintf_indent (f, indent,
+		  "if (!maybe_push_res_to_seq (res_op, NULL))\n");
+		  fprintf_indent (f, indent + 2, "return false;\n");
+		}
+	}
 	}
   else if (result->type == operand::OP_CAPTURE
 	   || result->type == operand::OP_C_EXPR)
diff --git a/gcc/match.pd b/gcc/match.pd
index 6e45836e32b..46fd880bd37 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2570,15 +2570,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (for plusminus (plus minus)
   (simplify
(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
-   (if ((!ANY_INTEGRAL_TYPE_P (type)
-	 || TYPE_OVERFLOW_WRAPS (type)
-	 || (INTEGRAL_TYPE_P (type)
-	 && tree_expr_nonzero_p (@0)
-	 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
-	/* If @1 +- @2 is constant require a hard single-use on either
-	   original operand (but not on both).  */
-	&& (single_use (@3) || single_use (@4)))
-(mult (plusminus @1 @2) @0)))
+   (if (!ANY_INTEGRAL_TYPE_P (type)
+	|| TYPE_OVERFLOW_WRAPS (type)
+	|| (INTEGRAL_TYPE_P (type)
+	&& tree_expr_nonzero_p (@0)
+	&& expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
+(if (single_use (@3) || single_use (@4))
+ /* If @1 +- @2 is constant require a hard single-use on either
+	original operand (but not on both).  */
+ (mult (plusminus @1 @2) @0)
+#if GIMPLE
+ (mult! (plusminus @1 @2) @0)
+#endif
+  )))
   /* We cannot generate constant 1 for fract.  */
   (if (!ALL_FRACT_MODE_P (TYPE_MODE (type)))
(simplify
diff --git a/gcc/testsuite/gcc.dg/pr94234-2.c b/gcc/testsuite/gcc.dg/pr94234-2.c
new file mode 100644
index 000..1f4b194dd43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-2.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */ 
+
+int use_fn (int a);
+
+int foo (int n)
+{
+  int b1 = 8 * (n + 1);
+  int b2 = 8 * n;
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned hoo (unsigned k_param)
+{
+  unsigned b1 = k_param * 28;
+  unsigned b2 = k_param * 15;
+  unsigned b3 = k_param * 12;
+
+  use_fn (b1 ^ b2 ^ b3);
+
+  return (b1 - b2) - b3;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-not "return k_param" "forwprop1" } } */
-- 
2.17.1

Re: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-09-01 Thread Feng Xue OS via Gcc-patches

>> >> gcc/
>> >> PR tree-optimization/94234
>> >> * tree-ssa-forwprop.c (simplify_binary_with_convert): New 
>> >> function.
>> >> * (fwprop_ssa_val): Move it before its new caller.
>>
>> > No * at this line.  There's an entry for (pass_forwprop::execute) missing.
>> OK.
>>
>> > I don't think the transform as implemented, ((T) X) OP ((T) Y) to
>> > (T) (X OP Y) is useful to do in tree-ssa-forwprop.c.  Instead what I
>> > suggested was to do the original
>> >
>> > +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
>> > +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
>> >
>> > but realize we already do this for GENERIC in fold_plusminus_mult_expr, 
>> > just
>> > without the conversions (also look at the conditions in the callers).  This
>> > function takes great care for handling overflow correctly and thus I 
>> > suggested
>> > to take that over to GIMPLE in tree-ssa-forwprop.c and try extend it to 
>> > cover
>> > the conversions you need for the specific cases.
>> But this way would introduce duplicate handling. Is it more concise to reuse
>> existing rule?

> Sure moving the GENERIC folding to match.pd so it covers both GENERIC
> and GIMPLE would be nice to avoid duplication.

>> And different from GENERIC, we might need to check whether operand is 
>> single-use
>> or not, and have distinct actions accordingly.
>>
>>(T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C)
>>
>> Suppose both A and B are multiple-used, in most situations, the transform
>> is unprofitable and avoided. But if (A +- B) could be folded to a constant, 
>> we
>> can still allow the transform. For this, we have to recursively fold (A 
>> +-B), either
>> handle it manually or resort to gimple-matcher to tell result. The latter is 
>> a
>> natural choice. If so, why not do it on the top.
>
> I don't understand.  From the comments in your patch you are just
> hoisting conversions in the transform.  I don't really see the connection
> to the originally desired transform here?

A code sequence as:

  t1 = (T)(A * C)
  t2 = (T)(B * C)

  ... = use (t1)
  ... = use (t2)

  t3 = t1 - t2

Since t1 and t2 are not single-use, we do not expect the transform on t3
happens in that it incurs more (add/mul) operations in most situations.
But if (A - B) * C can be folded to a constant or an existing SSA, the transform
is OK. That is to say we need to try to fold (A - B) and (A - B) * C to peak 
the final
result. To do this, it is natural to use gimple-matcher instead of manually 
pattern
matching as fold_plusminus_mult_expr, which could not cover all cases as gimple
rules.

 Some examples:
 A = n + 2,  B = n + 1,  C=m
 A = n - m,  B = n,  C = -1
 A = 3 * n,  B = 2 * n,  C = 1

And this way can be easily generalized to handle ((T) X) OP ((T) Y).

>> > Alternatively one could move the GENERIC bits to match.pd, leaving a
>> > worker in fold-const.c.  Then try to extend that there.
>> This worker function is meant to be used by both GENERIC and GIMPLE?

> Yes, for both.

Thanks,
Feng

Re: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-09-01 Thread Feng Xue OS via Gcc-patches


>> gcc/
>> PR tree-optimization/94234
>> * tree-ssa-forwprop.c (simplify_binary_with_convert): New function.
>> * (fwprop_ssa_val): Move it before its new caller.

> No * at this line.  There's an entry for (pass_forwprop::execute) missing.
OK.

> I don't think the transform as implemented, ((T) X) OP ((T) Y) to
> (T) (X OP Y) is useful to do in tree-ssa-forwprop.c.  Instead what I
> suggested was to do the original
> 
> +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
> +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
> 
> but realize we already do this for GENERIC in fold_plusminus_mult_expr, just
> without the conversions (also look at the conditions in the callers).  This
> function takes great care for handling overflow correctly and thus I suggested
> to take that over to GIMPLE in tree-ssa-forwprop.c and try extend it to cover
> the conversions you need for the specific cases.
But this way would introduce duplicate handling. Is it more concise to reuse
existing rule? 

And different from GENERIC, we might need to check whether operand is single-use
or not, and have distinct actions accordingly.

   (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C)

Suppose both A and B are multiple-used, in most situations, the transform
is unprofitable and avoided. But if (A +- B) could be folded to a constant, we
can still allow the transform. For this, we have to recursively fold (A +-B), 
either
handle it manually or resort to gimple-matcher to tell result. The latter is a
natural choice. If so, why not do it on the top.

> Alternatively one could move the GENERIC bits to match.pd, leaving a
> worker in fold-const.c.  Then try to extend that there.
This worker function is meant to be used by both GENERIC and GIMPLE?

> I just remember this is a very fragile area with respect to overflow
> correctness.

Thanks,
Feng

PING: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-08-31 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng

From: Feng Xue OS 
Sent: Wednesday, August 19, 2020 5:17 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org; Marc Glisse
Subject: [PATCH V2] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

As Richard's comment, this patch is composed to simplify generalized
binary-with-convert pattern like ((T) X) OP ((T) Y). Instead of creating
almost duplicated rules into match.pd, we try to transform it to (T) (X OP Y),
and apply simplification on (X OP Y) in forwprop pass.

Regards,
Feng
---
2020-08-19  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_binary_with_convert): New function.
* (fwprop_ssa_val): Move it before its new caller.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/ifcvt-3.c: Modified to suppress forward propagation.
* gcc.dg/tree-ssa/20030807-10.c: Likewise.
* gcc.dg/pr94234-2.c: New test.

> 
> From: Richard Biener 
> Sent: Monday, June 15, 2020 3:41 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org; Marc Glisse
> Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
> base/offset (PR 94234)
>
> On Fri, Jun 5, 2020 at 11:20 AM Feng Xue OS  
> wrote:
>>
>>  As Marc suggested, removed the new pointer_diff rule, and add another rule 
>> to fold
>>  convert-add expression. This new rule is:
>>
>>(T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)
>>
>>  Regards,
>>  Feng
>>
>>  ---
>> 2020-06-01  Feng Xue  
>>
>>  gcc/
>>  PR tree-optimization/94234
>>  * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
>>  simplification.
>>  * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
>>  simplification.
>>
>>  gcc/testsuite/
>>  PR tree-optimization/94234
>>  * gcc.dg/pr94234.c: New test.
>>  ---
>>   gcc/match.pd   | 28 
>>   gcc/testsuite/gcc.dg/pr94234.c | 24 
>>   2 files changed, 52 insertions(+)
>>   create mode 100644 gcc/testsuite/gcc.dg/pr94234.c
>>
>>  diff --git a/gcc/match.pd b/gcc/match.pd
>>  index 33ee1a920bf..4f340bfe40a 100644
>>  --- a/gcc/match.pd
>>  +++ b/gcc/match.pd
>>  @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>   && TREE_CODE (@2) == INTEGER_CST
>>   && tree_int_cst_sign_bit (@2) == 0))
>>(minus (convert @1) (convert @2)
>>  +   (simplify
>>  +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
>>  + (pointer_diff @0 @1))
>
> This new pattern is OK.  Please commit it separately.
>
>>  (simplify
>>   (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
>>   /* The second argument of pointer_plus must be interpreted as signed, 
>> and
>>  @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>(minus (convert (view_convert:stype @1))
>>  (convert (view_convert:stype @2)))
>>
>>  +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
>>  +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
>>  +(if (INTEGRAL_TYPE_P (type))
>>  + (for plusminus (plus minus)
>>  +  (simplify
>>  +   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
>>  +(convert (mult (plusminus @1 @2) @0
>>  +  (simplify
>>  +   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) 
>> @0
>>  +  (simplify
>>  +   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) 
>> @0))
>>  +
>
> This shows the limit of pattern matching IMHO.  I'm also not convinced
> it gets the
> overflow cases correct (but I didn't spend too much time here).  Note we have
> similar functionality implemented in fold_plusminus_mult_expr.  IMHO instead
> of doing the above moving fold_plusminus_mult_expr to GIMPLE by executing
> it from inside the forwprop pass would make more sense.  Or finally biting the
> bullet and try to teach reassociation about how to handle signed arithmetic
> with non-wrapping overflow behavior.
>
> Richard.

Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-08-31 Thread Feng Xue OS via Gcc-patches

>>> the component is "ipa," please change that when you commit the patch.
>> Mistake has been made, I'v pushed it. Is there a way to correct it? git push 
>> --force?
>
> There is.  You need to wait until tomorrow (after the commit message
> gets copied to gcc/ChangeLog by a script) and then push a commit that
> modifies nothing else but the ChangeLog. IIUC.
> 
> Thanks again for taking care of this,

I will. Thanks.

Feng

Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-08-31 Thread Feng Xue OS via Gcc-patches

>> gcc/
>> PR tree-optimization/96806

> the component is "ipa," please change that when you commit the patch.
Mistake has been made, I'v pushed it. Is there a way to correct it? git push 
--force?

Thanks,
Feng

[PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-08-31 Thread Feng Xue OS via Gcc-patches

This patch is to fix a bug that cost that is used to evaluate clone candidate
becomes negative due to integer overflow.

Feng
---
2020-08-31  Feng Xue  

gcc/
PR tree-optimization/96806
* ipa-cp.c (decide_about_value): Use safe_add to avoid cost addition
overflow.

gcc/testsuite/
PR tree-optimization/96806
* g++.dg/ipa/pr96806.C: New test.From 8d92b4ca4be2303a73f0a2441e57564488ca1c23 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 31 Aug 2020 15:00:52 +0800
Subject: [PATCH] ipa/96806 - Fix ICE in ipa-cp due to integer addition
 overflow

2020-08-31  Feng Xue  

gcc/
PR tree-optimization/96806
* ipa-cp.c (decide_about_value): Use safe_add to avoid cost addition
	overflow.

gcc/testsuite/
PR tree-optimization/96806
* g++.dg/ipa/pr96806.C: New test.
---
 gcc/ipa-cp.c   |  8 ++---
 gcc/testsuite/g++.dg/ipa/pr96806.C | 53 ++
 2 files changed, 57 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/ipa/pr96806.C

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index e4910a04ffa..8e5d6e2a393 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -5480,11 +5480,11 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
    freq_sum, count_sum,
    val->local_size_cost)
   && !good_cloning_opportunity_p (node,
-  val->local_time_benefit
-  + val->prop_time_benefit,
+  safe_add (val->local_time_benefit,
+		val->prop_time_benefit),
   freq_sum, count_sum,
-  val->local_size_cost
-  + val->prop_size_cost))
+  safe_add (val->local_size_cost,
+		val->prop_size_cost)))
 return false;
 
   if (dump_file)
diff --git a/gcc/testsuite/g++.dg/ipa/pr96806.C b/gcc/testsuite/g++.dg/ipa/pr96806.C
new file mode 100644
index 000..28fdf7787a1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ipa/pr96806.C
@@ -0,0 +1,53 @@
+/* { dg-do compile } */
+/* { dg-options "-std=c++11 -O -fipa-cp -fipa-cp-clone --param=ipa-cp-max-recursive-depth=94 --param=logical-op-non-short-circuit=0" } */
+
+enum a {};
+struct m;
+struct n {
+  a d;
+};
+int o(int, int);
+struct p {
+  char d;
+  char aa;
+  p *ab;
+  bool q() const {
+int h = d & 4;
+return h;
+  }
+  char r() const { return aa; }
+  int s(const m *, bool) const;
+} l;
+struct t {
+  p *ac;
+  p *u() { return ac; }
+  p *v(int);
+};
+int w(const p *, const p *, const m *, int = 0);
+struct m : n {
+  struct {
+t *ad;
+  } ae;
+  char x() const;
+  p *y(int z) const { return ae.ad ? nullptr : ae.ad->v(z); }
+} j;
+int w(const p *z, const p *af, const m *ag, int ah) {
+  int a, g = z->s(ag, true), i = af->s(ag, true);
+  if (af->q()) {
+if (ag->x())
+  return 0;
+ah++;
+char b = af->r();
+p *c = ag->y(b), *e = ag->ae.ad->u();
+int d = w(z, c, ag, ah), f = w(z, af ? e : af->ab, ag, ah);
+a = f ? d : f;
+return a;
+  }
+  if (g || i == 1)
+return ag->d ? o(g, i) : o(g, i);
+  return 0;
+}
+void ai() {
+  for (p k;;)
+w(, , );
+}
-- 
2.17.1

[PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-08-19 Thread Feng Xue OS via Gcc-patches

As Richard's comment, this patch is composed to simplify generalized
binary-with-convert pattern like ((T) X) OP ((T) Y). Instead of creating
almost duplicated rules into match.pd, we try to transform it to (T) (X OP Y),
and apply simplification on (X OP Y) in forwprop pass.

Regards,
Feng
---
2020-08-19  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_binary_with_convert): New function.
* (fwprop_ssa_val): Move it before its new caller.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/ifcvt-3.c: Modified to suppress forward propagation.
* gcc.dg/tree-ssa/20030807-10.c: Likewise.
* gcc.dg/pr94234-2.c: New test.

> 
> From: Richard Biener 
> Sent: Monday, June 15, 2020 3:41 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org; Marc Glisse
> Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
> base/offset (PR 94234)
> 
> On Fri, Jun 5, 2020 at 11:20 AM Feng Xue OS  
> wrote:
>>
>>  As Marc suggested, removed the new pointer_diff rule, and add another rule 
>> to fold
>>  convert-add expression. This new rule is:
>> 
>>(T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)
>> 
>>  Regards,
>>  Feng
>> 
>>  ---
>> 2020-06-01  Feng Xue  
>> 
>>  gcc/
>>  PR tree-optimization/94234
>>  * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
>>  simplification.
>>  * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
>>  simplification.
>> 
>>  gcc/testsuite/
>>  PR tree-optimization/94234
>>  * gcc.dg/pr94234.c: New test.
>>  ---
>>   gcc/match.pd   | 28 
>>   gcc/testsuite/gcc.dg/pr94234.c | 24 
>>   2 files changed, 52 insertions(+)
>>   create mode 100644 gcc/testsuite/gcc.dg/pr94234.c
>> 
>>  diff --git a/gcc/match.pd b/gcc/match.pd
>>  index 33ee1a920bf..4f340bfe40a 100644
>>  --- a/gcc/match.pd
>>  +++ b/gcc/match.pd
>>  @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>   && TREE_CODE (@2) == INTEGER_CST
>>   && tree_int_cst_sign_bit (@2) == 0))
>>(minus (convert @1) (convert @2)
>>  +   (simplify
>>  +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
>>  + (pointer_diff @0 @1))
> 
> This new pattern is OK.  Please commit it separately.
> 
>>  (simplify
>>   (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
>>   /* The second argument of pointer_plus must be interpreted as signed, 
>> and
>>  @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>(minus (convert (view_convert:stype @1))
>>  (convert (view_convert:stype @2)))
>> 
>>  +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
>>  +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
>>  +(if (INTEGRAL_TYPE_P (type))
>>  + (for plusminus (plus minus)
>>  +  (simplify
>>  +   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
>>  +(convert (mult (plusminus @1 @2) @0
>>  +  (simplify
>>  +   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) 
>> @0
>>  +  (simplify
>>  +   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) 
>> @0))
>>  +
> 
> This shows the limit of pattern matching IMHO.  I'm also not convinced
> it gets the
> overflow cases correct (but I didn't spend too much time here).  Note we have
> similar functionality implemented in fold_plusminus_mult_expr.  IMHO instead
> of doing the above moving fold_plusminus_mult_expr to GIMPLE by executing
> it from inside the forwprop pass would make more sense.  Or finally biting the
> bullet and try to teach reassociation about how to handle signed arithmetic
> with non-wrapping overflow behavior.
> 
> Richard.

From 68bba2edb714f741ef6e7f4a7814869cb99e938c Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] Simplify binary-with-convert expression in forwprop pass (PR
 94234)

For expression as ((T) X) OP ((T) Y), try to transform it to (T) (X OP Y),
and apply simplification

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-12 Thread Feng Xue OS via Gcc-patches

> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

With default setting of PRED_LOOP_GUARD_WITH_RECURSION, static profile
estimation for exchange2 is far from accurate, the hottest recursive function
is predicted as infrequent. However, this low execution estimation works fine
with IRA. I've tried to tweak likelihood of the predictor, same as you,
performance was degraded when estimated profile increased. This regression
is also found to be correlated with IRA, which produces much more register
spills than default. In presence of deep loops and high register pressure, IRA
behaves more sensitively to profile estimation, and this exhibits an unwanted
property of current IRA algorithm. I've described it in a tracker
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174).

Feng

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.
> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.
> 
>> From: Xiong Hu Luo 
>> 
>>  For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>>  size 1300) generates specialized node from digits_2.1 to digits_2.8 with 
>> added
>>  build option:
>> 
>>  --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> 
>>  ipa-inline pass will consider inline these nodes called only once, but these
>>  large functions inlined too deeply will cause serious register spill and
>>  performance down as followed.
>> 
>>  inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 
>> 2.6, 2.7, 2.8)
>>  inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
>> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>>  inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
>> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>>  inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 
>> 2.5 -> 2.6 -> 2.7 -> 2.8
>> 
>>  Performance diff:
>>  inlineB is ~25% faster than inlineA;
>>  inlineC is ~20% faster than inlineB;
>>  inlineD is ~30% faster than inlineC.
>> 
>>  The master GCC code now generates inline sequence like inlineB, this patch
>>  makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive calls by adding the growth data
>>  to the edge when edge's caller is inlined into another function to avoid
>>  inline too deeply;
>>   2) And if caller and callee are both specialized from same node, the edge
>>  should also be considered as recursive edge.
>> 
>>  SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without 
>> exchange2).
>>  Any comments?  Thanks.
>> 
>>  523.xalancbmk_r +1.32%
>>  541.leela_r +1.51%
>>  548.exchange2_r +31.87%
>>  507.cactuBSSN_r +0.80%
>>  526.blender_r   +1.25%
>>  538.imagick_r   +1.82%
>> 
>>  gcc/ChangeLog:
>> 
>>  2020-08-12  Xionghu Luo  
>> 
>>* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>>callee and specialized from same node.
>>* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>>inlined_to growth to edge whose caller is inlined.
>>  ---
>>   gcc/cgraph.h  | 2 ++
>>   gcc/ipa-inline-analysis.c | 3 +++
>>   2 files changed, 5 insertions(+)
>> 
>>  diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>>  index 0211f08964f..11903ac1960 100644
>>  --- a/gcc/cgraph.h
>>  +++ b/gcc/cgraph.h
>>  @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>> cgraph_node *c =

Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-16 Thread Feng Xue OS via Gcc-patches

Here is an question about pointer operation: 

Pointer is treated as unsigned in comparison operation, while distance between
pointers is signed. Then we can not assume the below conclusion is true?

 (ptr_a > ptr_b) => (ptr_a - ptr_b) >= 0

Thanks,
Feng

From: Marc Glisse 
Sent: Wednesday, June 3, 2020 10:32 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote:

>> Ah, looking at the PR, you decided to perform the operation as unsigned
>> because that has fewer NOP conversions, which, in that particular testcase
>> where the offsets are originally unsigned, means we simplify better. But I
>> would expect it to regress other testcases (in particular if the offsets
>> were originally signed). Also, changing the second argument of
>> pointer_plus to be signed, as is supposed to eventually happen, would
>> break your testcase again.
> The old rule might produce overflow result (offset_a = (signed_int_max)UL,
> offset_b = 1UL).

signed_int_max-1 does not overflow. But the point is that pointer_plus /
pointer_diff are defined in a way that if that subtraction would overflow,
then one of the pointer_plus or pointed_diff would have been undefined
already. In particular, you cannot have objects larger than half the
address space, and pointer_plus/pointer_diff have to remain inside an
object. Doing the subtraction in a signed type keeps (part of) that
information.

> Additionally, (stype)(offset_a - offset_b) is more compact,

Not if offset_a comes from (utype)a and offset_b from (utype)b with a and
b signed. Using size_t indices as in the bugzilla testcase is not
recommended practice. Change it to ssize_t, and we do optimize the
testcase in CCP1 already.

> there might be
> further simplification opportunities on offset_a - offset_b, even it is not
> in form of (A * C - B * C), for example (~A - 1 -> -A). But for old rule, we 
> have
> to introduce another rule as (T)A - (T)(B) -> (T)(A - B), which seems to
> be too generic to benefit performance in all situations.

Sadly, conversions complicate optimizations and are all over the place, we
need to handle them in more places. I sometimes dream of getting rid of
NOP conversions, and having a single PLUS_EXPR with some kind of flag
saying if it can wrap/saturate/trap when seen as a signed/unsigned
operation, i.e. push the information on the operations instead of objects.

> If the 2nd argument is signed, we can add a specific rule as your suggestion
> (T)(A * C) - (T)(B * C) -> (T) (A - B) * C.
>
>> At the very least we want to keep a comment next to the transformation
>> explaining the situation.
>
>> If there are platforms where the second argument of pointer_plus is a
>> smaller type than the result of pointer_diff (can this happen? I keep
>> forgetting all the weird things some platforms do), this version may do an
>> unsafe zero-extension.
> If the 2nd argument is a smaller type, this might bring confuse semantic to
> pointer_plus operator. Suppose the type is a (unsigned) char, the expression
> "ptr + ((char) -1)" represents ptr + 255 or ptr - 1?

(pointer_plus ptr 255) would mean ptr - 1 on a platform where the second
argument of pointer_plus has size 1 byte.

Do note that I am not a reviewer, what I say isn't final.

--
Marc Glisse

Ping: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-14 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS 
Sent: Friday, June 5, 2020 5:20 PM
To: Richard Biener; gcc-patches@gcc.gnu.org; Marc Glisse
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

As Marc suggested, removed the new pointer_diff rule, and add another rule to 
fold
convert-add expression. This new rule is:

  (T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)

Regards,
Feng

---
2020-06-01  Feng Xue  

gcc/
PR tree-optimization/94234
* match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
simplification.
* ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
simplification.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234.c: New test.
---
 gcc/match.pd   | 28 
 gcc/testsuite/gcc.dg/pr94234.c | 24 
 2 files changed, 52 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 33ee1a920bf..4f340bfe40a 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 && TREE_CODE (@2) == INTEGER_CST
 && tree_int_cst_sign_bit (@2) == 0))
  (minus (convert @1) (convert @2)
+   (simplify
+(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
+ (pointer_diff @0 @1))
(simplify
 (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
 /* The second argument of pointer_plus must be interpreted as signed, and
@@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (minus (convert (view_convert:stype @1))
(convert (view_convert:stype @2)))

+/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
+   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
+(if (INTEGRAL_TYPE_P (type))
+ (for plusminus (plus minus)
+  (simplify
+   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+(convert (mult (plusminus @1 @2) @0
+  (simplify
+   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+   && single_use (@2) && single_use (@3))
+(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) @0
+  (simplify
+   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+   && single_use (@2) && single_use (@3))
+(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) @0))
+
 /* (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A -> A * (C+-1).
 Modeled after fold_plusminus_mult_expr.  */
 (if (!TYPE_SATURATING (type)
diff --git a/gcc/testsuite/gcc.dg/pr94234.c b/gcc/testsuite/gcc.dg/pr94234.c
new file mode 100644
index 000..3f7c7a5e58f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+ptrdiff_t goo (char *a, size_t n, size_t m)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n + 1);
+
+  return (b1 + m) - (b2 + m);
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
--



From: Richard Biener 
Sent: Thursday, June 4, 2020 4:30 PM
To: gcc-patches@gcc.gnu.org
Cc: Feng Xue OS
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

On Wed, Jun 3, 2020 at 4:33 PM Marc Glisse  wrote:
>
> On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote:
>
> >> Ah, looking at the PR, you decided to perform the operation as unsigned
> >> because that has fewer NOP conversions, which, in that particular testcase
> >> where the offsets are originally unsigned, means we simplify better. But I
> >> would expect it to regress other testcases (in particular if the offsets
> >> were originally signed). Also, changing the second argument of
> >> pointer_plus to be signed, as is supposed to eventually happen, would
>

Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-05 Thread Feng Xue OS via Gcc-patches

As Marc suggested, removed the new pointer_diff rule, and add another rule to 
fold
convert-add expression. This new rule is:

  (T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)

Regards,
Feng

---
2020-06-01  Feng Xue  

gcc/
PR tree-optimization/94234
* match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
simplification.
* ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
simplification.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234.c: New test.
---
 gcc/match.pd   | 28 
 gcc/testsuite/gcc.dg/pr94234.c | 24 
 2 files changed, 52 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 33ee1a920bf..4f340bfe40a 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 && TREE_CODE (@2) == INTEGER_CST
 && tree_int_cst_sign_bit (@2) == 0))
  (minus (convert @1) (convert @2)
+   (simplify
+(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
+ (pointer_diff @0 @1))
(simplify
 (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
 /* The second argument of pointer_plus must be interpreted as signed, and
@@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (minus (convert (view_convert:stype @1))
(convert (view_convert:stype @2)))

+/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
+   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
+(if (INTEGRAL_TYPE_P (type))
+ (for plusminus (plus minus)
+  (simplify
+   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+(convert (mult (plusminus @1 @2) @0
+  (simplify
+   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+   && single_use (@2) && single_use (@3))
+(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) @0
+  (simplify
+   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+   && single_use (@2) && single_use (@3))
+(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) @0))
+
 /* (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A -> A * (C+-1).
 Modeled after fold_plusminus_mult_expr.  */
 (if (!TYPE_SATURATING (type)
diff --git a/gcc/testsuite/gcc.dg/pr94234.c b/gcc/testsuite/gcc.dg/pr94234.c
new file mode 100644
index 000..3f7c7a5e58f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+ptrdiff_t goo (char *a, size_t n, size_t m)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n + 1);
+
+  return (b1 + m) - (b2 + m);
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
--



From: Richard Biener 
Sent: Thursday, June 4, 2020 4:30 PM
To: gcc-patches@gcc.gnu.org
Cc: Feng Xue OS
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

On Wed, Jun 3, 2020 at 4:33 PM Marc Glisse  wrote:
>
> On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote:
>
> >> Ah, looking at the PR, you decided to perform the operation as unsigned
> >> because that has fewer NOP conversions, which, in that particular testcase
> >> where the offsets are originally unsigned, means we simplify better. But I
> >> would expect it to regress other testcases (in particular if the offsets
> >> were originally signed). Also, changing the second argument of
> >> pointer_plus to be signed, as is supposed to eventually happen, would
> >> break your testcase again.
> > The old rule might produce overflow result (offset_a = (signed_int_max)UL,
> > offset_b = 1UL).
>
> signed_int_max-1 does not overflow. But the point is that pointer_plus /
> pointer_diff are defined in a way tha

Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-03 Thread Feng Xue OS via Gcc-patches

>>   * match.pd ((PTR + A) - (PTR + B)) -> (ptrdiff_t)(A - B): New
>>   simplification.

> Not new, modified.
OK.

>>   * ((PTR_A + O) - (PTR_B + O)) -> (PTR_A - PTR_B): New simplification.

> O might not be the best choice because of how close it looks to 0.
OK.

> What don't you like about the existing transformation? You are replacing a
> transformation that always folds by one that folds only in some cases, and
> looses the information that some overflows cannot happen. That looks like
> it is making things worse from an optimization point of view. Do you
> consider the transformation as unsafe with -fsanitize=pointer-overflow
> (does that correspond to the case where TYPE_OVERFLOW_UNDEFINED is true
> for a pointer type?)?
Yes. We should use !TYPE_OVERFLOW_SANITIZED, not TYPE_OVERFLOW_UNDEFINED.
But even for !TYPE_OVERFLOW_SANITIZED, some ptr_diff rules have the check, and 
some
do not. Here we could also remove it?

> Ah, looking at the PR, you decided to perform the operation as unsigned
> because that has fewer NOP conversions, which, in that particular testcase
> where the offsets are originally unsigned, means we simplify better. But I
> would expect it to regress other testcases (in particular if the offsets
> were originally signed). Also, changing the second argument of
> pointer_plus to be signed, as is supposed to eventually happen, would
> break your testcase again.
The old rule might produce overflow result (offset_a = (signed_int_max)UL, 
offset_b = 1UL). 

Additionally, (stype)(offset_a - offset_b) is more compact, there might be
further simplification opportunities on offset_a - offset_b, even it is not
in form of (A * C - B * C), for example (~A - 1 -> -A). But for old rule, we 
have
to introduce another rule as (T)A - (T)(B) -> (T)(A - B), which seems to
be too generic to benefit performance in all situations.

If the 2nd argument is signed, we can add a specific rule as your suggestion
(T)(A * C) - (T)(B * C) -> (T) (A - B) * C.

> At the very least we want to keep a comment next to the transformation
> explaining the situation.

> If there are platforms where the second argument of pointer_plus is a
> smaller type than the result of pointer_diff (can this happen? I keep
> forgetting all the weird things some platforms do), this version may do an
> unsafe zero-extension.
If the 2nd argument is a smaller type, this might bring confuse semantic to
pointer_plus operator. Suppose the type is a (unsigned) char, the expression
"ptr + ((char) -1)" represents ptr + 255 or ptr - 1?

Regards,
Feng

Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-03 Thread Feng Xue OS via Gcc-patches

>> This patch is meant to add match rules to simplify patterns as:
>>
>> o. (pointer + offset_a) - (pointer + offset_b)   ->   (ptrdiff_t) (offset_a 
>> - offset_b)
>> o. (pointer_a + offset) - (pointer_b + offset)   ->   (pointer_a - pointer_b)

> You are also changing the existing pattern which IIRC tries to
> preserve the undefinedness of overflow in offset_a - offset_b.
> Without an explanation it's hard to guess why you think eliding
> this conversion is correct.
The old rule adds signed cast to both offset_b and offset_b, and does
minus on them, as

   (stype)offset_a - (stype)offset_b

Suppose that offset_a = (signed_int_max)UL, offset_b = 1UL, the trans
will generate overflow result, while original computation will not.

Alternative way is to add type cast after minus are done on offset_a and offset 
b.
And to avoid unsigned overflow warning, we should add a view_convert 
instead of convert, which was missed in my patch.

> Adding the TYPE_OVERFLOW_UNDEFINED guard also looks
> odd - AFAICS overflow of the pointer type does not matter
> but overflow of the generated minus?  Thus at least
> a || !TYPE_OVERFLOW_UNDEFINED (type) would be
> in order?
Yes, it is. will remove TYPE_OVERFLOW_UNDEFINED on pointer.

Thanks,
Feng

[PATCH] Fix some improper debug dump in clone materialization

2020-06-01 Thread Feng Xue OS via Gcc-patches

Clone materialization might produce some improper debug output as:

Original--

cloning foo/271 to foo.constprop/334
   replace map: 0 -> xxx1->yyy
m_always_copy_start: 1
IPA adjusted parameters: foo (...)
{
...
}

And a better output could be:

cloning foo/271 to foo.constprop/334
replace map: 0 -> xxx, 1->yyy /* separate 1 with xxx,  */
m_always_copy_start: 1   /* Align with replace map */
IPA adjusted parameters:/* If no adjusted parameter, start 
a new line or omit this line */
foo (...)
{
...
}

Feng
---
2020-06-01  Feng Xue  

gcc/
* cgraphclones.c (materialize_all_clones): Adjust replace map dump.
* ipa-param-manipulation.c (ipa_dump_adjusted_parameters): Do not
dump infomation if there is no adjusted parameter.
* (ipa_param_adjustments::dump): Adjust prefix spaces for dump string.
---
 gcc/cgraphclones.c   | 6 +++---
 gcc/ipa-param-manipulation.c | 5 -
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
index e4f1c1d4b5e..db61c218297 100644
--- a/gcc/cgraphclones.c
+++ b/gcc/cgraphclones.c
@@ -1160,15 +1160,15 @@ symbol_table::materialize_all_clones (void)
  if (node->clone.tree_map)
{
  unsigned int i;
- fprintf (symtab->dump_file, "   replace map: ");
+ fprintf (symtab->dump_file, "replace map:");
  for (i = 0;
   i < vec_safe_length (node->clone.tree_map);
   i++)
{
  ipa_replace_map *replace_info;
  replace_info = (*node->clone.tree_map)[i];
- fprintf (symtab->dump_file, "%i -> ",
-  (*node->clone.tree_map)[i]->parm_num);
+ fprintf (symtab->dump_file, "%s %i -> ",
+  i ? "," : "", replace_info->parm_num);
  print_generic_expr (symtab->dump_file,
  replace_info->new_tree);
}
diff --git a/gcc/ipa-param-manipulation.c b/gcc/ipa-param-manipulation.c
index 978916057f0..2cc4bc79dc1 100644
--- a/gcc/ipa-param-manipulation.c
+++ b/gcc/ipa-param-manipulation.c
@@ -111,6 +111,9 @@ ipa_dump_adjusted_parameters (FILE *f,
   unsigned i, len = vec_safe_length (adj_params);
   bool first = true;
 
+  if (!len)
+return;
+
   fprintf (f, "IPA adjusted parameters: ");
   for (i = 0; i < len; i++)
 {
@@ -899,7 +902,7 @@ ipa_param_adjustments::dump (FILE *f)
   fprintf (f, "m_always_copy_start: %i\n", m_always_copy_start);
   ipa_dump_adjusted_parameters (f, m_adj_params);
   if (m_skip_return)
-fprintf (f, " Will SKIP return.\n");
+fprintf (f, "Will SKIP return.\n");
 }
 
 /* Dump information contained in the object in textual form to stderr.  */
--

[PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-01 Thread Feng Xue OS via Gcc-patches

This patch is meant to add match rules to simplify patterns as:

o. (pointer + offset_a) - (pointer + offset_b)   ->   (ptrdiff_t) (offset_a - 
offset_b)
o. (pointer_a + offset) - (pointer_b + offset)   ->   (pointer_a - pointer_b)

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-06-01  Feng Xue  

gcc/
PR tree-optimization/94234
* match.pd ((PTR + A) - (PTR + B)) -> (ptrdiff_t)(A - B): New
simplification.
* ((PTR_A + O) - (PTR_B + O)) -> (PTR_A - PTR_B): New simplification.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234.c: New test.
---
 gcc/match.pd   | 19 +--
 gcc/testsuite/gcc.dg/pr94234.c | 24 
 2 files changed, 33 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 33ee1a920bf..6553be4822e 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2515,16 +2515,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 && TREE_CODE (@2) == INTEGER_CST
 && tree_int_cst_sign_bit (@2) == 0))
  (minus (convert @1) (convert @2)
-   (simplify
-(pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
-/* The second argument of pointer_plus must be interpreted as signed, and
-   thus sign-extended if necessary.  */
-(with { tree stype = signed_type_for (TREE_TYPE (@1)); }
- /* Use view_convert instead of convert here, as POINTER_PLUS_EXPR
-   second arg is unsigned even when we need to consider it as signed,
-   we don't want to diagnose overflow here.  */
- (minus (convert (view_convert:stype @1))
-   (convert (view_convert:stype @2)))
+  (simplify
+   (pointer_diff (pointer_plus@3 @0 @1) (pointer_plus @0 @2))
+(if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@3)))
+  (convert (minus @1 @2
+  (simplify
+   (pointer_diff (pointer_plus@3 @0 @2) (pointer_plus @1 @2))
+(if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@3))
+&& !integer_zerop (@2))
+ (pointer_diff @0 @1)
 
 /* (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A -> A * (C+-1).
 Modeled after fold_plusminus_mult_expr.  */
diff --git a/gcc/testsuite/gcc.dg/pr94234.c b/gcc/testsuite/gcc.dg/pr94234.c
new file mode 100644
index 000..ef9076c80da
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ccp1" } */ 
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+ptrdiff_t goo (char *a, size_t n, size_t m)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n + 1);
+
+  return (b1 + m) - (b2 + m);
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "ccp1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "ccp1" } } */
From 160eaeb151197844005837dc4b8e1e27bb6dfadf Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 1 Jun 2020 11:57:35 +0800
Subject: [PATCH] tree-optimization/94234 - add ptr-diff pattern for addresses
 with same base or offset

2020-06-01  Feng Xue  

gcc/
	PR tree-optimization/94234
	* match.pd ((PTR + A) - (PTR + B)) -> (ptrdiff_t)(A - B): New
	simplification.
	* ((PTR_A + O) - (PTR_B + O)) -> (PTR_A - PTR_B): New simplification.

gcc/testsuite/
	PR tree-optimization/94234
	* gcc.dg/pr94234.c: New test.
---
 gcc/match.pd   | 19 +--
 gcc/testsuite/gcc.dg/pr94234.c | 24 
 2 files changed, 33 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 33ee1a920bf..6553be4822e 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2515,16 +2515,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 	 && TREE_CODE (@2) == INTEGER_CST
 	 && tree_int_cst_sign_bit (@2) == 0))
  (minus (convert @1) (convert @2)
-   (simplify
-(pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
-/* The second argument of pointer_plus must be interpreted as signed, and
-   thus sign-extended if necessary.  */
-(with { tree stype = signed_type_for (TREE_TYPE (@1)); }
- /* Use view_convert instead of convert here, as POINTER_PLUS_EXPR
-	second arg is unsigned even when we need to consider it as signed,
-	we don't want to diagnose overflow here.  */
- (minus (convert (view_convert:stype @1))
-	(convert (view_convert:stype @2)))
+  (simplify
+   (pointer_diff (pointer_plus@3 @0 @1) (pointer_plus @0 @2))
+(if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@3)))
+  (convert (minus @1 @2
+  (simplify
+   (pointer_diff (pointer_plus@3 @0 @2) (pointer_plus @1 @2))
+(if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@3))
+	 && !integer_zerop (@2))
+ (pointer_diff @0 @1)
 
 /* (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A -> A * (C+-1).
 Modeled after fold_plusminus_mult_expr.  */
diff

42 matches

Mail list logo