from:"Aldy Hernandez via Gcc"

[PATCH] Update email in MAINTAINERS file.

2024-09-23 Thread Aldy Hernandez via Gcc

From: Aldy Hernandez 

ChangeLog:

* MAINTAINERS: Update email and add myself to DCO.
---
 MAINTAINERS | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cfd96c9f33e..e9fafaf45a7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -116,7 +116,7 @@ riscv port  Jim Wilson  

 rs6000/powerpc port David Edelsohn  
 rs6000/powerpc port Segher Boessenkool  
 rs6000/powerpc port Kewen Lin   
-rs6000 vector extns Aldy Hernandez  
+rs6000 vector extns Aldy Hernandez  
 rx port Nick Clifton
 s390 port   Ulrich Weigand  
 s390 port   Andreas Krebbel 
@@ -213,7 +213,7 @@ c++ runtime libsJonathan Wakely 

 c++ runtime libs special modes  Fran??ois Dumont 
 fixincludes Bruce Korb  
 *gimpl* Jakub Jelinek   
-*gimpl* Aldy Hernandez  
+*gimpl* Aldy Hernandez  
 *gimpl* Jason Merrill   
 gcse.cc Jeff Law
 global opt frameworkJeff Law
@@ -240,7 +240,7 @@ option handling Joseph Myers

 middle-end  Jeff Law
 middle-end  Ian Lance Taylor
 middle-end  Richard Biener  
-*vrp, rangerAldy Hernandez  
+*vrp, rangerAldy Hernandez  
 *vrp, rangerAndrew MacLeod  
 tree-ssaAndrew MacLeod  
 tree browser/unparser   Sebastian Pop   
@@ -518,7 +518,7 @@ Daniel Hellstromdanielh 

 Fergus Henderson-   
 Richard Henderson   rth 
 Stuart Hendersonshenders
-Aldy Hernandez  aldyh   
+Aldy Hernandez  aldyh   
 Philip Herron   redbrain
 Marius Hillenbrand  -   
 Matthew Hiller  -   
@@ -948,3 +948,4 @@ Jonathan Wakely 

 Alexander Westbrooks
 Chung-Ju Wu 
 Pengxuan Zheng  
+Aldy Hernandez  
-- 
2.43.0

Re: 1.76% performance loss in VRP due to inlining

2024-05-03 Thread Aldy Hernandez via Gcc

After some very painful analysis, I was able to reduce the degradation
we are experiencing in VRP to a handful of lines in the new
implementation of prange.

What happens is that any series of small changes to a new prange class
causes changes in the inlining of wide_int_storage elsewhere.  With
the attached patch, one difference lies in irange::singleton_p(tree
*).  Note that this is in irange, which is completely unrelated to the
new (unused) code.

Using trunk as the stage1 compiler, we can see the assembly for
irange::singleton_p(tree *) in value-range.cc is different with and
without my patch.

The number of calls into wide_int within irange::singleton_p(tree *) changes:

awk '/^_ZNK6irange11singleton_pEPP9tree_node/,/endproc/' value-range.s
| grep call.*wide_int

With mainline sources:

call_ZN16wide_int_storageC2ERKS_
call
_Z16wide_int_to_treeP9tree_nodeRK8poly_intILj1E16generic_wide_intI20wide_int_ref_storageILb0ELb1

With the attached patch:

call_ZN16wide_int_storageC2ERKS_
call_ZN16wide_int_storageC2ERKS_
call
_Z16wide_int_to_treeP9tree_nodeRK8poly_intILj1E16generic_wide_intI20wide_int_ref_storageILb0ELb1
call_ZN16wide_int_storageC2ERKS_

The additional calls correspond to the wide_int_storage constructor:

$ c++filt _ZN16wide_int_storageC2ERKS_
wide_int_storage::wide_int_storage(wide_int_storage const&)

Using -fno-semantic-interposition makes no difference.

Here are the relevant bits in the difference from -Winline with and
without my patch:

> inlined from ‘virtual bool irange::singleton_p(tree_node**) const’ at 
> /home/aldyh/src/gcc/gcc/value-range.cc:1254:40:
> /home/aldyh/src/gcc/gcc/wide-int.h:1196:8: warning: inlining failed in call 
> to ‘wide_int_storage::wide_int_storage(const wide_int_storage&)’: --param 
> inline-unit-growth limit reached [-Winline]
>  1196 | inline wide_int_storage::wide_int_storage (const wide_int_storage &x)
>   |^~~~
> /home/aldyh/src/gcc/gcc/wide-int.h:775:7: note: called from here
>   775 | class GTY(()) generic_wide_int : public storage
>   |   ^~~~
> /home/aldyh/src/gcc/gcc/wide-int.h:1196:8: warning: inlining failed in call 
> to ‘wide_int_storage::wide_int_storage(const wide_int_storage&)’: --param 
> inline-unit-growth limit reached [-Winline]
>  1196 | inline wide_int_storage::wide_int_storage (const wide_int_storage &x)
>   |^~~~
> /home/aldyh/src/gcc/gcc/wide-int.h:775:7: note: called from here
>   775 | class GTY(()) generic_wide_int : public storage
>   |   ^~~~
> In copy constructor 
> ‘generic_wide_int::generic_wide_int(const 
> generic_wide_int&)’,
> inlined from ‘wide_int irange::lower_bound(unsigned int) const’ at 
> /home/aldyh/src/gcc/gcc/value-range.h:1122:25,

Note that this is just one example.  There are also inlining
differences to irange::get_bitmask(), irange::union_bitmask(),
irange::operator=, among others.  Most of the inlining failures seem
to be related to wide_int_storage.  I am attaching the difference in
-Winline for the curious.

Tracking this down is tricky because the slightest change in the patch
causes different inlining in irange.  Even using a slightly different
stage1 compiler produces different changes.  For example, using GCC 13
as the stage1 compiler, VRP exhibits a slowdown of 2% with the full
prange class.  Although this is virtually identical to the slowdown
for using trunk as the stage1 compiler, the inlining failures are a
tad different.

I am tempted to commit the attached to mainline, which slows down VRP
by 0.3%, but is measurable enough to analyze, just so we have a base
commit-point from where to do the analysis.  My wife is about to give
birth any day now, so I'm afraid if I drop off for a few months, we'll
lose the analysis and the point in time from where to do it.

One final thing.  The full prange class patch, even when disabled,
slows VRP by 2%.  I tried to implement the class in small increments,
and every small change caused a further slowdown.  I don't know if
this 2% is final, or if further tweaks in this space will slow us down
more.

On a positive note, with the entirety of prange implemented (not just
the base class but range-ops implemented and prange enabled, there is
no overall change to VRP, and IPA-cp speeds up by 7%.  This is because
holding pointers in prange is a net win that overcomes the 2% handicap
the inliner is hitting us with.

I would love to hear thoughts, and if y'all agree that committing a
small skeleton now can help us track this down in the future.

Aldy

On Tue, Apr 30, 2024 at 11:37 PM Jason Merrill  wrote:
>
> On 4/30/24 12:22, Jakub Jelinek wrote:
> > On Tue, Apr 30, 2024 at 03:09:51PM -0400, Jason Merrill via Gcc wrote:
> >> On Fri, Apr 26, 2024

Re: 1.76% performance loss in VRP due to inlining

2024-04-30 Thread Aldy Hernandez via Gcc

On Tue, Apr 30, 2024 at 9:58 AM Richard Biener
 wrote:
>
> On Fri, Apr 26, 2024 at 11:45 AM Aldy Hernandez via Gcc  
> wrote:
> >
> > Hi folks!
> >
> > In implementing prange (pointer ranges), I have found a 1.74% slowdown
> > in VRP, even without any code path actually using the code.  I have
> > tracked this down to irange::get_bitmask() being compiled differently
> > with and without the bare bones patch.  With the patch,
> > irange::get_bitmask() has a lot of code inlined into it, particularly
> > get_bitmask_from_range() and consequently the wide_int_storage code.
> >
> > I don't know whether this is expected behavior, and if it is, how to
> > mitigate it.  I have tried declaring get_bitmask_from_range() inline,
> > but that didn't help.  OTOH, using __attribute__((always_inline))
> > helps a bit, but not entirely.  What does help is inlining
> > irange::get_bitmask() entirely, but that seems like a big hammer.
>
> You can use -Winline to see why we don't inline an inline declared
> function.  I would guess the unit-growth limit kicks in?

Ah, will do.  Thanks.

>
> Did you check a release checking compiler?  That might still
> inline things.

Yes, we only measure performance with release builds.

Aldy

1.76% performance loss in VRP due to inlining

2024-04-26 Thread Aldy Hernandez via Gcc

Hi folks!

In implementing prange (pointer ranges), I have found a 1.74% slowdown
in VRP, even without any code path actually using the code.  I have
tracked this down to irange::get_bitmask() being compiled differently
with and without the bare bones patch.  With the patch,
irange::get_bitmask() has a lot of code inlined into it, particularly
get_bitmask_from_range() and consequently the wide_int_storage code.

I don't know whether this is expected behavior, and if it is, how to
mitigate it.  I have tried declaring get_bitmask_from_range() inline,
but that didn't help.  OTOH, using __attribute__((always_inline))
helps a bit, but not entirely.  What does help is inlining
irange::get_bitmask() entirely, but that seems like a big hammer.

The overall slowdown in compilation is 0.26%, because VRP is a
relatively fast pass, but a measurable pass slowdown is something we'd
like to avoid.

What's the recommended approach here?

For the curious, I am attaching before and after copies of
value-range.s.  I am also attaching the two patches needed to
reproduce the problem on mainline.  The first patch is merely setup.
It is the second patch that exhibits the problem.  Notice there are no
uses of prange yet.

Thanks.
Aldy
From ee63833c5f56064ef47c2bb9debd485f77d00171 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez 
Date: Tue, 19 Mar 2024 18:04:55 +0100
Subject: [PATCH] Move get_bitmask_from_range out of irange class.

---
 gcc/value-range.cc | 52 +++---
 gcc/value-range.h  |  1 -
 2 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/gcc/value-range.cc b/gcc/value-range.cc
index 70375f7abf9..0f81ce32615 100644
--- a/gcc/value-range.cc
+++ b/gcc/value-range.cc
@@ -31,6 +31,30 @@ along with GCC; see the file COPYING3.  If not see
 #include "fold-const.h"
 #include "gimple-range.h"
 
+// Return the bitmask inherent in a range.
+
+static irange_bitmask
+get_bitmask_from_range (tree type,
+			const wide_int &min, const wide_int &max)
+{
+  unsigned prec = TYPE_PRECISION (type);
+
+  // All the bits of a singleton are known.
+  if (min == max)
+{
+  wide_int mask = wi::zero (prec);
+  wide_int value = min;
+  return irange_bitmask (value, mask);
+}
+
+  wide_int xorv = min ^ max;
+
+  if (xorv != 0)
+xorv = wi::mask (prec - wi::clz (xorv), false, prec);
+
+  return irange_bitmask (wi::zero (prec), min | xorv);
+}
+
 void
 irange::accept (const vrange_visitor &v) const
 {
@@ -1832,31 +1856,6 @@ irange::invert ()
 verify_range ();
 }
 
-// Return the bitmask inherent in the range.
-
-irange_bitmask
-irange::get_bitmask_from_range () const
-{
-  unsigned prec = TYPE_PRECISION (type ());
-  wide_int min = lower_bound ();
-  wide_int max = upper_bound ();
-
-  // All the bits of a singleton are known.
-  if (min == max)
-{
-  wide_int mask = wi::zero (prec);
-  wide_int value = lower_bound ();
-  return irange_bitmask (value, mask);
-}
-
-  wide_int xorv = min ^ max;
-
-  if (xorv != 0)
-xorv = wi::mask (prec - wi::clz (xorv), false, prec);
-
-  return irange_bitmask (wi::zero (prec), min | xorv);
-}
-
 // Remove trailing ranges that this bitmask indicates can't exist.
 
 void
@@ -1978,7 +1977,8 @@ irange::get_bitmask () const
   // in the mask.
   //
   // See also the note in irange_bitmask::intersect.
-  irange_bitmask bm = get_bitmask_from_range ();
+  irange_bitmask bm
+= get_bitmask_from_range (type (), lower_bound (), upper_bound ());
   if (!m_bitmask.unknown_p ())
 bm.intersect (m_bitmask);
   return bm;
diff --git a/gcc/value-range.h b/gcc/value-range.h
index 9531df56988..dc5b153a83e 100644
--- a/gcc/value-range.h
+++ b/gcc/value-range.h
@@ -351,7 +351,6 @@ private:
   bool varying_compatible_p () const;
   bool intersect_bitmask (const irange &r);
   bool union_bitmask (const irange &r);
-  irange_bitmask get_bitmask_from_range () const;
   bool set_range_from_bitmask ();
 
   bool intersect (const wide_int& lb, const wide_int& ub);
-- 
2.44.0

From 03c70de43177a97ec5e9c243aafc798c1f37e6d8 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez 
Date: Wed, 20 Mar 2024 06:25:52 +0100
Subject: [PATCH] Implement minimum prange class exhibiting VRP slowdown.

---
 gcc/value-range-pretty-print.cc |  25 +++
 gcc/value-range-pretty-print.h  |   1 +
 gcc/value-range.cc  | 274 
 gcc/value-range.h   | 196 +++
 4 files changed, 496 insertions(+)

diff --git a/gcc/value-range-pretty-print.cc b/gcc/value-range-pretty-print.cc
index c75cbea3955..154253e047f 100644
--- a/gcc/value-range-pretty-print.cc
+++ b/gcc/value-range-pretty-print.cc
@@ -113,6 +113,31 @@ vrange_printer::print_irange_bitmasks (const irange &r) const
   pp_string (pp, p);
 }
 
+void
+vrange_printer::visit (const prange &r) const
+{
+  pp_string (pp, "[prange] ");
+  if (r.undefined_p ())
+{
+  pp_string (pp, "UNDEFINED");
+  return;
+}
+  dump_generic_node (pp, r.type (), 0

Upcoming removal of legacy ranges and conversion to wide_ints.

2023-04-25 Thread Aldy Hernandez via Gcc


After GCC 13 is released we will remove legacy range support from the
compiler, and convert irange's to wide_ints.  I want to give everyone
a heads up, to help understand what's involved and what the end result is.

Legacy ranges are basically the value_range type (int_range<1>) where
the internal representation has anti-ranges and allows symbolics, etc.
It is a holdout from the old VRP, and was left in place because it was
pervasive and too risky remove late in the GCC 13 cycle.

There will be lots of patches (about 30), mostly because legacy
touches a lot, and also because it was necessary to rip out things in
order to double check the work.  Furthermore, having small pieces
makes it easier to bisect any possible regressions.

I will give a bird's eye view of what follows with details in the
patches themselves.

First the good news.  VRP improves by 13.22%, jump threading by 11.6%
and overall compilation improves by 1.5%.  Andrew has some
improvements to the cache that should provide significant additional
speedups.

Here are the main parts of the work:

1. Converting users of the old API to the new irange API.

   There are a few holdouts, most notably the middle end warnings,
   some of which have strong dependencies on VR_ANTI_RANGE and the old
   API.  I have provided a transitional function (get_legacy_range)
   which translates an irange to whatever min/max/kind concoction they
   rely on.  I have no plans for converting these passes, as I don't
   understand the code well enough to fix it.  Naive attempts broke
   the tests, even though conceptually the changes were correct.  At
   least the damage will be limited to users of one function:
   get_legacy_range().

   The IPA passes also use the old API, but I have plans (and patches)
   for revamping all of the IPA ranges.  Details below.

2. Repurposing int_range<1> to its obvious meaning (a range with two
   endpoints).  This will reduce the memory footprint for small
   ranges.

3. Overhauling the vrange storage mechanism used for global ranges
   (SSA_NAME_RANGE_INFO) so it can be shared with the ranger cache.

   The reason for this is because the ranger cache used a tree range
   allocator, which didn't exactly scale to the eventual conversion to
   wide ints.  It also allows us to use one lean and efficient
   allocator for everything instead of two incomplete implementations.

   The storage remains slim (no trees, plus a trailing_wide_int like
   implementation).  The storage is also quite fast, since without
   legacy or trees, we can copy host integers arrays back and forth
   between the storage and the wide_ints.

4. Conversion of trees to wide ints in irange.

5. Various performance optimizations now possible because we have
   moved away from both legacy and trees.

A few notes.

The value_range typedef has been renamed to int_range<2>, since
int_range<1> now means two endpoints which can't represent the inverse
of a range (say not-zero) for anything but a handful of ranges.  The
(future) plan is to mechanically convert the definitions to
int_range<2> and finally rename Value_Range (the type agnostic range
class) to value_range for use in passes that must work in a variety of
different types (floats, integers, etc).

IPA has been rolling their own ranges forever.  They use a combination
of the legacy API along with handcrafted pairs of wide_ints (ipa_vr).
The passes must be divorced of its legacy dependency, and cleaned up a
bit.  IPA is also very tied to integers and pointers, and I see no
reason why we can't keep track of float arguments, etc.

I am sitting on a lot of additional patches to do 90% of the
conversion, but need to consult with the IPA experts on various issues
before proceeding (for instance, the lifetime of various structures).
Among these patches are generic vrange LTO streaming functions and
vrange hashing.  I think it's time we make vrange a first class
citizen for LTO/hashing and a few other things folks were doing in an
ad-hoc manner.  This should alleviate the maintenance burden on the
IPA maintainers going forward.  Note, that the IPA work is a
follow-up, and only after careful consultation with the relevant
maintainers.

In addition to the IPA/LTO work described above, future work this
cycle will include converting irange to the value/mask mechanism we
use elsewhere in the compiler (CCP, etc).  With wide ints in place, this 
should be relatively straightforward, plus it will give us additional

optimization opportunities in the ranger ecosystem.

Comments welcome.

Aldy and Andrew

Re: Stepping down as gcov maintainer and callgraph reviewer

2023-02-17 Thread Aldy Hernandez via Gcc

Woah.  Sad to see you leave!

It's always been a pleasure to work with you.

Thanks for all the great work, and good luck on your next endeavors.

Aldy

On Thu, Feb 16, 2023 at 4:55 PM Martin Liška  wrote:
>
> Hello GCC community.
>
> After spending last decade (including my diploma thesis even more)
> of my life working on GCC, I decided to leave the project and try
> a different job. I would like to thank all the community members I had
> change to interact with, I learned so much and enjoyed the journey!
> I'll be leaving somewhen at the beginning of May.
>
> That said, I'm stepping down from my 2 positions as I won't have a time
> for proper patch review and bugs in the area I'm responsible for.
>
> I wish the project all the best!
>
> Cheers,
> Martin

Re: get_range_query vs NULL argument

2023-02-16 Thread Aldy Hernandez via Gcc





On 2/16/23 08:57, Richard Biener wrote:

On Wed, Feb 15, 2023 at 11:31 PM Andrew MacLeod via Gcc  wrote:



On 2/15/23 14:50, Andrew Pinski wrote:

Hi,
While fixing PR 108354, I came across that
ssa_name_has_boolean_range calls get_range_query with cfun as the
argument but sometimes while in IPA passes cfun is currently nullptr.
The question should we check the argument before calling
get_range_query or is it a valid thing to call it with a nullptr (and
have it return global_ranges in that case)?


That might be ok...  personally I see nothing wrong with:

diff --git a/gcc/value-query.h b/gcc/value-query.h
index 63878968118..2d7bf8fcf33 100644
--- a/gcc/value-query.h
+++ b/gcc/value-query.h
@@ -140,7 +140,7 @@ get_global_range_query ()
   ATTRIBUTE_RETURNS_NONNULL inline range_query *
   get_range_query (const struct function *fun)
   {
-  return fun->x_range_query ? fun->x_range_query : &global_ranges;
+  return (fun && fun->x_range_query) ? fun->x_range_query : &global_ranges;
   }

   // Query the global range of NAME in function F.  Default to cfun.


The client is probably going to do that anyway.


But if there's no 'fun', what is 'global_ranges' initialized for?  Or
is 'global_ranges'
usable in IPA context?


If there is no fun, global_ranges will just return what get_range_info() 
used to return (i.e. SSA_NAME_RANGE_INFO).


Aldy

Re: are most floating point cases in tree_call_nonnegative_warnv_p() wrong for HONOR_NANS?

2022-11-13 Thread Aldy Hernandez via Gcc

I suppose most of the calls to tree_expr_nonnegative_p that apply to
floats in match.pd and tree-ssa-math-opts.cc are guarded by
!HONOR_NANS || tree_expr_nonnegative_p.  Still feels kinda delicate...

Aldy

On Sun, Nov 13, 2022 at 4:56 PM Aldy Hernandez  wrote:
>
> Based on discussions in the last few weeks, aren't most of the cases in
> tree_call_nonnegative_warnv_p() wrong when honoring NANS?
>
> For example:
>  CASE_CFN_ACOS:
>  CASE_CFN_ACOS_FN:
>  CASE_CFN_ACOSH:
>  CASE_CFN_ACOSH_FN:
> ...
> ...
>/* Always true.  */
>return true;
>
> But are we guaranteed a +NAN for any NAN input?  I thought we were only
> guaranteed the NAN sign for abs, copysign, assignment, etc?  Similarly
> for most other cases in this function.
>
> Hmmm.  I really think a good chunk of fold-const.cc should live in
> range-ops.  It seems we're duplicating a lot of functionality.
> Similarly to bit-CCP as I've mentioned.
>
> Aldy

are most floating point cases in tree_call_nonnegative_warnv_p() wrong for HONOR_NANS?

2022-11-13 Thread Aldy Hernandez via Gcc

Based on discussions in the last few weeks, aren't most of the cases in 
tree_call_nonnegative_warnv_p() wrong when honoring NANS?


For example:
CASE_CFN_ACOS:
CASE_CFN_ACOS_FN:
CASE_CFN_ACOSH:
CASE_CFN_ACOSH_FN:
...
...
  /* Always true.  */
  return true;

But are we guaranteed a +NAN for any NAN input?  I thought we were only 
guaranteed the NAN sign for abs, copysign, assignment, etc?  Similarly 
for most other cases in this function.


Hmmm.  I really think a good chunk of fold-const.cc should live in 
range-ops.  It seems we're duplicating a lot of functionality. 
Similarly to bit-CCP as I've mentioned.


Aldy

Re: spaceship_replacement cannot see through simplified set of FP conditionals

2022-07-29 Thread Aldy Hernandez via Gcc

Swt!

Would you like me to XFAIL the test or leave it as a failure?

Thanks.
Aldy

On Fri, Jul 29, 2022 at 1:50 PM Jakub Jelinek  wrote:
>
> On Fri, Jul 29, 2022 at 01:40:12PM +0200, Aldy Hernandez wrote:
> > With my upcoming patch enabling floating point VRP,
> > g++.dg/opt/pr94589-2.C is failing:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598788.html
> >
> > The problem is that phiopt no longer sees the following snippet,
> > because we have folded away the final conditional as true:
>
> > Could someone give me a hand here? Can spaceship_replacement easily be
> > adapted to handle this case, or should we be simplifying this
> > elsewhere?
>
> Leave it to me, once you commit, file a PR and I'll work on it.
>
> Jakub
>

spaceship_replacement cannot see through simplified set of FP conditionals

2022-07-29 Thread Aldy Hernandez via Gcc

Hi.

With my upcoming patch enabling floating point VRP,
g++.dg/opt/pr94589-2.C is failing:

https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598788.html

The problem is that phiopt no longer sees the following snippet,
because we have folded away the final conditional as true:

bool f5 (double i, double j)
{
  signed char __v$_M_value;
  signed char c$_M_value;
  bool D.8519;
  struct partial_ordering __v;
  struct partial_ordering c;
  unsigned int _11;
  bool _16;

   :
  if (i_2(D) != j_4(D))
goto ; [INV]
  else
goto ; [INV]

   :
  if (i_2(D) >= j_4(D))
goto ; [INV]
  else
goto ; [INV]

   :
  if (i_2(D) > j_4(D))
goto ; [INV]
  else
goto ; [INV]

   :

   :
  # c$_M_value_3 = PHI <-1(3), 0(2), 2(5), 1(4)>
  _11 = (unsigned int) c$_M_value_3;
  _16 = _11 <= 1;
  return _16;
}

This means that spaceship_replacement()  now sees one less argument to the PHI:

   :
  if (i_2(D) != j_4(D))
goto ; [INV]
  else
goto ; [INV]

   :
  if (i_2(D) >= j_4(D))
goto ; [INV]
  else
goto ; [INV]

   :

   :
  # c$_M_value_3 = PHI <-1(3), 0(2), 1(4)>

...and we bail because we are expecting 4 arguments:

  if (EDGE_COUNT (phi_bb->preds) != 4)
return false;

Could someone give me a hand here? Can spaceship_replacement easily be
adapted to handle this case, or should we be simplifying this
elsewhere?

Thanks.
Aldy

Re: is_a<> with references

2022-06-01 Thread Aldy Hernandez via Gcc

Issue resolved.  Typedef no longer needed.

Sorry for the noise.
Aldy

On Wed, Jun 1, 2022 at 4:29 PM Aldy Hernandez  wrote:
>
> Hi folks.
>
> I rolled our own is_a<> implementation for vrange, because it was
> trivial.  We needed it to work on references, and GCC's is-a.h
> implementation is pointer-only. However, I now realize it confuses
> gengtype when adding additional types:
>
> template
> struct vrange_traits
> {
>   // Default to something unusable.
>   typedef void range_type;
> };
>
> template<>
> struct vrange_traits
> {
>   typedef irange range_type;
> };
>
> template<>
> struct vrange_traits
> {
>   typedef frange range_type;  <-- BOO! HISS!
> };
>
> build/gengtype  \
> -S /home/aldyh/src/gcc/gcc -I gtyp-input.list -w
> tmp-gtype.state
> /home/aldyh/src/gcc/gcc/value-range.h:291: type `range_type' previously 
> defined
> /home/aldyh/src/gcc/gcc/value-range.h:285: previously defined here
>
> Can some gengtype guru offer guidance?
>
> Thanks.
> Aldy

is_a<> with references

2022-06-01 Thread Aldy Hernandez via Gcc

Hi folks.

I rolled our own is_a<> implementation for vrange, because it was
trivial.  We needed it to work on references, and GCC's is-a.h
implementation is pointer-only. However, I now realize it confuses
gengtype when adding additional types:

template
struct vrange_traits
{
  // Default to something unusable.
  typedef void range_type;
};

template<>
struct vrange_traits
{
  typedef irange range_type;
};

template<>
struct vrange_traits
{
  typedef frange range_type;  <-- BOO! HISS!
};

build/gengtype  \
-S /home/aldyh/src/gcc/gcc -I gtyp-input.list -w
tmp-gtype.state
/home/aldyh/src/gcc/gcc/value-range.h:291: type `range_type' previously defined
/home/aldyh/src/gcc/gcc/value-range.h:285: previously defined here

Can some gengtype guru offer guidance?

Thanks.
Aldy

Re: Announcement: gcobol

2022-03-30 Thread Aldy Hernandez via Gcc

I paid for high school and university writing Cobol, so I'm strangely
interested in the project.  Seems like life has come full circle for
me :).

Aldy

On Mon, Mar 14, 2022 at 9:35 PM James K. Lowden
 wrote:
>
> https://git.symas.net:443/cobolworx/gcc-cobol/
> https://github.com/Apress/beg-cobol-for-programmers
>
> Greetings, gcc!  We come bearing gifts!
>
> When you set your clock ahead an hour yesterday, you might not have
> realized you set your calendar back to 1985.  There's a new gcc COBOL
> compiler. We call it: gcobol.
>
> On the books, we have 1 man-year invested: two full-time
> programmers since October 2021.
>
> We have so far compiled just over 100 programs from the examples in
> "Beginning COBOL for Programmers", by Michael Coughlin. We are near the
> end of that phase of the project, and expect to have ISAM and
> Object-Oriented Cobol features implemented in the next few weeks.  We
> are working on compiling the NIST COBOL test suite, which we expect
> will take a few months to complete.  We have begun work on  gdb, too,
> and hope to have it working by year end.
>
> Our project should not be confused with GnuCOBOL
> (https://savannah.gnu.org/projects/gnucobol).  That project is a Cobol
> translator: it compiles Cobol to C, and invokes gcc to produce
> executable code.  Our gcobol compiler is (currently) a fork of gcc.  It
> implements a gcc frontend for Cobol and (obviously) invokes the gcc
> backend to produce executables.  (We have a friendly relationship with
> GnuCOBOL, and its maintainer supports our undertaking.)
>
> Ours should also not be confused with prior efforts to create a gcc
> Cobol compiler.  Others have tried and failed.  Failure wasn't an
> option for us.  I won't say it was easy, but here we are.
>
> Eventually, if the gcc maintainers are interested, we would like to
> pursue full integration with gcc.  For the moment, we have questions
> we're hoping can be answered here by those who ran the gauntlet
> before us.  Given the state of the internals documentation, that seems
> like our best option. We've been rummaging around in the odd sock
> drawer for too long.
>
> If you're like me (like I was), your visceral response to this
> announcement can be summed up in one word: Why?
>
> The answer is as easy as it is trite: the right tool for the job.
>
> I wouldn't write an operating system in Cobol.  But I wouldn't write
> one in Python or Java, either.  Cobol has a niche no other language
> occupies: a compiled language for record-oriented I/O.
>
> That might sound strangely specialized, but it's not.  Record-oriented
> I/O describes, I would argue, nearly *all* applications.  Yet, since the
> advent of C, nearly all applications have relegated I/O to an external
> library, and adopted the Unix byte-stream definition of a "file".
>
> If you've written a CGI web application, you know what I'm talking
> about.  Cobol eliminates a lot of gobbledygook by reducing free-form
> run-time variables to compile-time constants.
>
> That's the rationale, and it's not just a theory.  Cobol is alive and
> kicking.  Estimates vary, but they all say north of 100 billion lines
> of Cobol are still in use, with millions more written every year, even
> now, in the 21st century.  Odds are your last ATM transaction or credit
> card purchase went through a Cobol application.
>
> There's another answer to Why: because a free Cobol compiler is an
> essential component to any effort to migrate mainframe applications to
> what mainframe folks still call "distributed systems".  Our goal is a
> Cobol compiler that will compile mainframe applications on Linux.  Not
> a toy: a full-blooded replacement that solves problems.  One that runs
> fast and whose output runs fast, and has native gdb support.
>
> I am happy to debate the lunacy of this project and the viability of
> Cobol, either here or off-list.  Today, we want to make the project
> known to those in the technical community who might most want to know
> what we're up to, and explain why we'll be asking the questions we're
> asking.
>
> Also, if you want to give it a whirl, please don't hesitate.  We're
> happy to help, and expect to learn something in the process.
>
> Thank you for you kind attention.
>
> --jkl
>
>

-Wuninitialized false positives and threading knobs

2021-10-31 Thread Aldy Hernandez via Gcc

After Jeff's explanation of the symbiosis between jump threading and
the uninit pass, I'm beginning to see that (almost) every
Wuninitialized warning is cause for reflection.  It usually hides a
missing jump thread.  I investigated one such false positive
(uninit-pred-7_a.c) and indeed, there's a missing thread.  The
question is what to do about it.

This seemingly simple test is now regressing as can be seen by the
xfail I added.

What happens is that we now thread far more than before, causing the
distance from definition to use to expand.  The threading candidate
that would make the Wuninitialized go away is there, and the backward
threader can see it, but it refuses to thread it because the number of
statements would be too large.

This is interesting because it means threading is causing larger IL
that in turn keeps us from threading some unreachable paths later on
because the paths are too large.

If you look at the *.threadfull2 dump for the attached simplified
test, you can see that the 3->5->6-8->10->13 path would elide the
unreachable read, but alas we can't look past BB5, because it would
thread too many statements:

Checking profitability of path (backwards):  bb:10 (2 insns) bb:8 (2
insns) bb:6 (6 insns) bb:5
  Control statement insns: 2
  Overall: 8 insns
  FAIL: Did not thread around loop and would copy too many statements.

The "problem" we have is that if there's a path in the IL, the new
threader *will* exploit it (ranger dependent).  This in turns opens up
opportunities for other threaders (even DOM) creating a cascading
effect.

For the attached test, we can squelched the warning with a mere:

--param=max-jump-thread-duplication-stmts=19

I don't know how we decided on the default 15 param, and if it makes
sense to tweak this, but our current threading passes, and how they
relate to VRP  and the uninit pass look like this:

$ ls a.c.* | grep -e thread -e dom[23] -e vrp[12] -e uninit
a.c.034t.ethread
a.c.111t.threadfull1
a.c.112t.vrp1
a.c.126t.thread1
a.c.127t.dom2
a.c.191t.thread2
a.c.192t.dom3
a.c.194t.threadfull2
a.c.195t.vrp2
a.c.209t.uninit1

Perhaps we could turn down the knobs for thread[12] and increase them
for threadfull[12]?  I really don't know.  For this particular test,
we could even turn off thread1, increase the duplication statements,
and eliminate the warning.  This would leave DOM2 without the threader
that runs before it though.

I'm out of my depth here, plus I'm a bit hesitant to make performance
decisions to improve warnings.  On the other hand, it's sad that
improved threading is causing regressions on a test as simple as this
one.  That being said, I generally don't mention it, but the threading
improvements so far solve more problems than they introduce, so
perhaps we should do nothing??

I'd be curious to hear what others think.  Perhaps others could play
with the different knobs and see if there's a better combination that
could keep warnings and optimizers in better equilibrium.

[FWIW Martin, you could revisit some of the uninit regressions and see
if tweaking the above --param would silence the bogus warning.  In
which case, it's a hint that the regression may not be due to the
uninit code itself.  FTR, I'm not saying we _should_ thread more, just
that this could be a tool to help diagnose].

Aldy
/* { dg-do compile } */
/* { dg-options "-Wuninitialized -O2" } */

int g;
void blah1(int);
void blah2(int);
void crapola(int);

int foo (int n, int l, int m, int r)
{
  int v;

  if (n || l)
v = r;

  if (m)
g++;

  if ( n && l)
  blah1(v); /* { dg-bogus "uninitialized" "bogus warning" } */

  if ( n )
  blah2(v); /* { dg-bogus "uninitialized" "bogus warning" } */

  if ( l )
  crapola(v); /* { dg-bogus "uninitialized" "bogus warning" { xfail *-*-* } } */

  return 0;
}

Re: [TCWG CI] 471.omnetpp slowed down by 8% after gcc: Avoid invalid loop transformations in jump threading registry.

2021-09-27 Thread Aldy Hernandez via Gcc


[CCing Jeff and list for broader audience]

On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:

Hi Aldy,

Your patch seems to slow down 471.omnetpp by 8% at -O3.  Could you please take 
a look if this is something that could be easily fixed?


First of all, thanks for chasing this down.  It's incredibly useful to 
have these types of bug reports.


Jeff and I have been discussing the repercussions of adjusting the loop 
crossing restrictions in the various threaders.  He's seen some 
regressions in embedded targets when disallowing certain corner cases of 
loop crossing threads causes all sorts of grief.


Out of curiosity, does the attached (untested) patch fix the regression?

Aldy



Regards,

--
Maxim Kuvyrkov
https://www.linaro.org


On 27 Sep 2021, at 02:52, ci_not...@linaro.org wrote:

After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez 

Avoid invalid loop transformations in jump threading registry.

the following benchmarks slowed down by more than 2%:
- 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples

Below reproducer instructions can be used to re-build both "first_bad" and 
"last_good" cross-toolchains used in this bisection.  Naturally, the scripts will fail 
when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.

For your convenience, we have uploaded tarballs with pre-processed source and 
assembly files at:
- First_bad save-temps: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-4a960d548b7d7d942f316c5295f6d849b74214f5/save-temps/
- Last_good save-temps: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-29c92857039d0a105281be61c10c9e851aaeea4a/save-temps/
- Baseline save-temps: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-baseline/save-temps/

Configuration:
- Benchmark: SPEC CPU2006
- Toolchain: GCC + Glibc + GNU Linker
- Version: all components were built from their tip of trunk
- Target: arm-linux-gnueabihf
- Compiler flags: -O3 -marm
- Hardware: NVidia TK1 4x Cortex-A15

This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at 
linaro-toolch...@lists.linaro.org .  In our improvement plans is to add support for SPEC 
CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.

THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, REPRODUCTION 
INSTRUCTIONS, AND THE RAW COMMIT.

This commit has regressed these CI configurations:
- tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3

First_bad build: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-4a960d548b7d7d942f316c5295f6d849b74214f5/
Last_good build: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-29c92857039d0a105281be61c10c9e851aaeea4a/
Baseline build: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-baseline/
Even more details: 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/

Reproduce builds:

mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5

# Fetch scripts
git clone https://git.linaro.org/toolchain/jenkins-scripts

# Fetch manifests and test.sh script
mkdir -p artifacts/manifests
curl -o artifacts/manifests/build-baseline.sh 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/manifests/build-baseline.sh
 --fail
curl -o artifacts/manifests/build-parameters.sh 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/manifests/build-parameters.sh
 --fail
curl -o artifacts/test.sh 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/test.sh
 --fail
chmod +x artifacts/test.sh

# Reproduce the baseline build (build all pre-requisites)
./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh

# Save baseline build state (which is then restored in artifacts/test.sh)
mkdir -p ./bisect
rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
--exclude /gcc/ ./ ./bisect/baseline/

cd gcc

# Reproduce first_bad build
git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5
../artifacts/test.sh

# Reproduce last_good build
git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a
../artifacts/test.sh

cd ..


Full commit (up to 1000 lines):

commit 4a960d548b7d7d942f316c5295f6d849b74214f5
Author: Aldy Hernandez 
Date:   Thu Sep 23 10:59:24 2021 +0200

Avoid invalid loop transformations in jump threading registry.

My upcoming improvements to the forward jump threade

Re: Can gcc.dg/torture/pr67828.c be an infinite loop?

2021-09-24 Thread Aldy Hernandez via Gcc





On 9/24/21 1:37 PM, David Brown wrote:

On 24/09/2021 10:03, Aldy Hernandez via Gcc wrote:

Hi folks.

My upcoming threading improvements turn the test below into an infinite
runtime loop:

int a, b;
short c;

int
main ()
{
   int j, d = 1;
   for (; c >= 0; c++)
     {
BODY:
   a = d;
   d = 0;
   if (b)
 {
   xprintf (0);
   if (j)
     xprintf (0);
 }
     }
   xprintf (d);
   exit (0);
}

On the false edge out of if(b) we thread directly to BODY, eliding the
loop conditional, because we know that c>=0 because it could never
overflow.

Since B is globally initialized to 0, this has the effect of turning the
test into an infinite loop.

Is this correct, or did I miss something?
Aldy




I am wondering about the claim that you can use "b" being 0 to optimise
the conditional.  If the compiler knows that this is the complete
program, that is fair enough.  But since "b" is not static, another
compilation unit could modify "b" before "main" is called.  (In C, it is
legal for another part of the code to call main() - perhaps the
implementation of xprintf does so.  And in C++, a global constructor
could change "b" before main() starts.)


In this case, it was on a thread where we knew we came through the c>=0 
conditional and we hadn't left the function.


Be that as it may, this was something else entirely.  The problem was a 
typo in the path solver that was causing it to use an incorrect range, 
which was then causing the elision.  It had nothing to do with promotion 
rules.  My bad.


Aldy

Re: Can gcc.dg/torture/pr67828.c be an infinite loop?

2021-09-24 Thread Aldy Hernandez via Gcc


On 9/24/21 11:38 AM, Andrew Pinski wrote:

On Fri, Sep 24, 2021 at 2:35 AM Aldy Hernandez  wrote:




On 9/24/21 11:29 AM, Andrew Pinski wrote:

On Fri, Sep 24, 2021 at 1:05 AM Aldy Hernandez via Gcc  wrote:


Hi folks.

My upcoming threading improvements turn the test below into an infinite
runtime loop:

int a, b;
short c;

int
main ()
{
 int j, d = 1;
 for (; c >= 0; c++)
   {
BODY:
 a = d;
 d = 0;
 if (b)
  {
xprintf (0);
if (j)
  xprintf (0);
  }
   }
 xprintf (d);
 exit (0);
}

On the false edge out of if(b) we thread directly to BODY, eliding the
loop conditional, because we know that c>=0 because it could never overflow.


Huh about c>=0 being always true? the expression, "c++" is really c=
(short)(((int)c)+1).  So it will definitely wrap over when c is
SHRT_MAX.


I see.

Is this only for C++ or does it affect C as well?


This is standard C code; promotion rules; that is if a type is less
than int, it will be promoted to int if all of the values fit into
int; otherwise it will be promoted to unsigned int.


Well, *that* was embarrassing.

Thanks, and sorry for the noise.
Aldy

Re: Can gcc.dg/torture/pr67828.c be an infinite loop?

2021-09-24 Thread Aldy Hernandez via Gcc





On 9/24/21 11:29 AM, Andrew Pinski wrote:

On Fri, Sep 24, 2021 at 1:05 AM Aldy Hernandez via Gcc  wrote:


Hi folks.

My upcoming threading improvements turn the test below into an infinite
runtime loop:

int a, b;
short c;

int
main ()
{
int j, d = 1;
for (; c >= 0; c++)
  {
BODY:
a = d;
d = 0;
if (b)
 {
   xprintf (0);
   if (j)
 xprintf (0);
 }
  }
xprintf (d);
exit (0);
}

On the false edge out of if(b) we thread directly to BODY, eliding the
loop conditional, because we know that c>=0 because it could never overflow.


Huh about c>=0 being always true? the expression, "c++" is really c=
(short)(((int)c)+1).  So it will definitely wrap over when c is
SHRT_MAX.


I see.

Is this only for C++ or does it affect C as well?

Aldy

Re: Can gcc.dg/torture/pr67828.c be an infinite loop?

2021-09-24 Thread Aldy Hernandez via Gcc





On 9/24/21 10:08 AM, Richard Biener wrote:

On Fri, Sep 24, 2021 at 10:04 AM Aldy Hernandez via Gcc  wrote:


Hi folks.

My upcoming threading improvements turn the test below into an infinite
runtime loop:

int a, b;
short c;

int
main ()
{
int j, d = 1;
for (; c >= 0; c++)
  {
BODY:
a = d;
d = 0;
if (b)
 {
   xprintf (0);
   if (j)
 xprintf (0);
 }
  }
xprintf (d);
exit (0);
}

On the false edge out of if(b) we thread directly to BODY, eliding the
loop conditional, because we know that c>=0 because it could never overflow.

Since B is globally initialized to 0, this has the effect of turning the
test into an infinite loop.

Is this correct, or did I miss something?


Yes, 'c' will wrap to negative SHORT_MIN and terminate the loop via
the c>=0 test.


Huh, so SHORT_MAX + 1 = SHORT_MIN?  I thought that was an overflow, and 
therefore undefined.


Aldy



Mind c++ is really (short)(((int)c)++) and signed integer truncation
is implementation
defined.

Richard.


Aldy

Can gcc.dg/torture/pr67828.c be an infinite loop?

2021-09-24 Thread Aldy Hernandez via Gcc


Hi folks.

My upcoming threading improvements turn the test below into an infinite 
runtime loop:


int a, b;
short c;

int
main ()
{
  int j, d = 1;
  for (; c >= 0; c++)
{
BODY:
  a = d;
  d = 0;
  if (b)
{
  xprintf (0);
  if (j)
xprintf (0);
}
}
  xprintf (d);
  exit (0);
}

On the false edge out of if(b) we thread directly to BODY, eliding the 
loop conditional, because we know that c>=0 because it could never overflow.


Since B is globally initialized to 0, this has the effect of turning the 
test into an infinite loop.


Is this correct, or did I miss something?
Aldy

Re: replacing the VRP threader

2021-09-22 Thread Aldy Hernandez via Gcc

On Tue, Sep 21, 2021 at 12:48 PM Aldy Hernandez  wrote:

> SUMMARY
> ===
>
> The hybrid threader gets 14.5% more jump threads than the legacy code,
> but most of these threads are at the expense of other threading passes.
> The more it gets, the less DOM and the backward threader get.  That
> being said, there is a total improvement of 0.87% jump threads in the
> total compilation.

Well, it turns out we're considerably better than reported.

Andrew just found a one-line change in the path solver that improves
our VRP threading goodness to 18.5% and our overall jump threading
gains to 1.28%.

Yay!
Aldy

replacing the VRP threader

2021-09-21 Thread Aldy Hernandez via Gcc

In order to replace VRP, we need an alternative for the jump threader 
embedded within it.  Currently, this threader is a forward threader 
client that uses ASSERT_EXPRs and the avail/const framework to resolve 
statements along a path.


As I have mentioned in the past weeks, I am proposing a hybrid 
replacement, where we still use the forward threader, but with the path 
solver used by the backward threader.  This provides an incremental path 
to remove the forward threader with minimal disruption.


As a reminder, the reason we can't just use the backward threader for a 
VRP-threader replacement right now is because the forward and backward 
threaders have different cost models and block copiers that have not 
been merged.  I'd like to address these differences for the next 
release. This, coupled with support for floats in ranger, would allow us 
to nuke the forward threader entirely.  But for now, a more gradual 
approach is needed.


I have patches to the path solver that address all the shortcomings it 
had in my initial rewrite.  That is, the ability to use the ranger and 
the relation oracle to fully resolve ranges and relations not known 
within the path.  With this enhanced solver, we are able to get 
everything the forward threader can get (both VRP and DOM clients, with 
the exception of floats for DOM).  This path solver can then be used 
either with the backward threader (disabled by default), or with with 
the forward threader in the hybrid approach I propose for VRP.


I'd like to discuss how I tested this, what is missing, and alternatives 
on going forward.


SUMMARY
===

The hybrid threader gets 14.5% more jump threads than the legacy code, 
but most of these threads are at the expense of other threading passes. 
The more it gets, the less DOM and the backward threader get.  That 
being said, there is a total improvement of 0.87% jump threads in the 
total compilation.


This comes with no runtime penalty.  In our callgrind testing harness 
derived from the .ii files from a bootstrap, I measured a 0.62% 
performance drop, well within the noise of the tool.


However, there is a 1.5% performance penalty for splitting the VRP 
threader from outside of VRP (for either hybrid or legacy).  I would 
prefer divorcing embedded jump threading passes from their carriers, but 
if others disagree, we could piggy back on the ranger used in the 
upcoming VRP replacement (evrp has a ranger we could share).


TESTING
===

The presence of ASSERT_EXPRs made it difficult to do a clean comparison, 
as ranger obviously doesn't use these.  What I did was move the VRP 
threader into its own pass (regenerating ASSSERT_EXPRs and its 
environment), and then run my hybrid threader iteratively until no more 
changes (excluding ping-pongs).  Then I ran the legacy code, and 
anything it could find, was something worth investigating.


BTW, the reason for the iterative approach, is because any threading 
pass creates opportunities for subsequent threading passes.


MISSING CASES
=

There were a few things we missed, which can be summarized in broad 
categories:


a) A few missing range-op entries for things like RROTATE_EXPR.  These 
are trivial to address.


b) Some paths which the solver could clearly solve, but it never got the 
chance, because of limitations in the path discovery code in the forward 
threader.


I fixed a few of these (upcoming patches), but I mostly avoided fixing 
the core forward threader code, as it would incur too much work for very 
little return.  Besides, my plan is to eventually nuke it.


One example are paths whose first block ends in a conditional, but whose 
second is empty.  These are usually empty loop pre-headers, which are 
not empty in legacy mode because the ASSERT_EXPR mechanism has put 
asserts in there.  However, since the ranger uses the pristine IL, these 
blocks are empty, which throws off the threader.  In practice, this 
didn't matter, because since the CFG had been cleaned up, these empty 
blocks were predominantly loop pre-headers which the threader was trying 
to thread through, crossing loop boundaries in the process.  As has been 
discussed earlier, we shouldn't be threading through this.


c) The path solver returns UNDEFINED for unreachable paths, so we avoid 
threading them altogether.  However, the legacy code is perfectly happy 
to give answers and thread through the paths.


d) I saw an odd missed propagation/recalculation here and there, where 
ranger returned varying, when it could've done better.  This was very rare.


e) Finally, conditionals when either operand is a MEM_REF are not 
handled.  Interestingly, this was uncommon as most everything had been 
simplified to an SSA.


As the numbers show, the above is probably noise.  The only thing worth 
addressing may be the MEM_REF business.  If this becomes a problem, it 
is precisely what we designed the pointer equivalence tracking we use in 
evrp.  It could be easily adapted.


PATH F

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-10 Thread Aldy Hernandez via Gcc





On 9/10/21 6:21 PM, Jeff Law wrote:



On 9/10/2021 10:05 AM, Aldy Hernandez wrote:



On 9/10/21 5:43 PM, Jeff Law wrote:



On 9/9/2021 3:21 AM, Aldy Hernandez wrote:




   /* If this path does not thread through the loop latch, then we are
  using the FSM threader to find old style jump threads. This
  is good, except the FSM threader does not re-use an existing
  threading path to reduce code duplication.

  So for that case, drastically reduce the number of statements
  we are allowed to copy.  */


*blink*

Woah.  The backward threader has been using FSM threads 
indiscriminately as far as I can remember.  I wonder what would 
break if we "fixed it".
?!?  I'm not sure what you're suggesting here.  If you s/FSM 
threader/backwards threader/ in the comment does it make more sense? 
The term FSM really should largely have been dropped as the backwards 
threader was improved to handle more cases.


back_threader_registry::register_path() uses EDGE_FSM_THREAD as the 
thread type to register threads.  I was wondering if it should have 
been some combination of EDGE_START_JUMP_THREAD / 
EDGE_*_COPY_SRC_BLOCK, etc.  I (purposely) know nothing about the 
underlying threading types ;-). But if the backwards threader has been 
improved, then perhaps we should just remove the confusing FSM 
references.
No we shouldn't change it to any of the other types. EDGE_FSM_THREAD 
means a thread found by the backwards threader and it's the key used to 
determine which of the two CFG updating mechanisms should be used, the 
generic copier in the case of EDGE_FSM_THREAD.



Changing the name, yes, absolutely.  I probably should have done that 
when the original FSM threader was tweaked to handle generic threading.


I'll put that on my list.



As you've probably heard me mention before, all the EDGE_FSM_THREAD 
stuff in the registry really could be pulled out.   The registry's 
purpose is to deal with the two stage nature of jump threading in DOM 
(find threads, then optimize them later).  I don't think any of the 
backwards threading bits need that two stage handling.


Yeah yeah.  But you said that a year ago, and all I heard was *mumble 
mumble/complicated things*.  That was before I had much exposure to the 
code.  Now I feel a lot more comfortable ;-).


I'll also put this on my list, but it may not get done this release, 
cause I'm running against the clock with the VRP/threader replacement, 
which is what's keeping us back from replacing VRP with an evrp instance 
right now :).




My current thinking is that replacing the forward VRP threader with a 
hybrid one is a gentler approach to the longer term goal of replacing 
the forward threader altogether.  However, all the work I've been 
doing could go either way-- we could try the forward/VRP replacement 
or a hybrid approach.  It will all use the path solver underneath.
And that's probably a reasonable intermediate step on the way towards 
removing the VRP threading.




My main problem with replacing the forward/VRP with a backward client 
is that the cost models are so different that it was difficult to 
compare how we fared.  I constantly ran into threads the solver could 
handle just fine, but profitable_path_p was holding it back.

Yea.  Sorry about that tangle of insanity



FWIW, we get virtually everything the forward threader gets, minus a 
very few things.  At least when I plug in the solver to the 
DOM/forwarder threader, it can solve everything it can (minus noise 
and floats).
So once you plug those bits in, we don't have to carry around the 
avail/copies tables for the threader anymore, right?  That's a nice 
cleanup in and of itself.


Correct.  For the VRP/hybrid approach I'm working on, there are no 
copies/avails.  The solver can do everything they did.  After all, it's 
an easier target, since VRP threading only happens on ints and without 
the IL changing constantly.






If you prefer a backward threader instance to replace the VRP/forward 
threader, I'm game.  It's just harder to compare. Either way (backward 
threader or a hybrid forward+solver) uses the same underlying solver 
which is solid.
I think we can go hybrid, then look at the next step, which could well 
be bringing some consistency into the costing models.




c) DOM changes the IL as it goes.  Though we could conceivably 
divorce do the threading after DOM is done.
The only reason threading runs in parallel with DOM is so that it can 
use the context sensitive equivalences.  With the infrastructure 
you're building, there's a reasonable chance we can move to a model 
where we run DOM (and in the long term a simpler DOM) and threading 
as distinct, independent passes.


Andrew mumbled something about replacing all of DOM eventually :-). 
Well, except that value-numbering business I bet.
Essentially a realization of Bodik's work in GCC.   The nugget in there 
is it's a path sensitive optimizer.  That's kindof what I've envisioned 
DOM turning into.


1. We

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-10 Thread Aldy Hernandez via Gcc





On 9/10/21 5:51 PM, Jeff Law wrote:



On 9/9/2021 4:15 AM, Richard Biener wrote:



b) Even though we can seemingly fold everything DOM/threader does, in
order to replace it with a backward threader instance we'd have to merge
the cost/profitability code scattered throughout the forward threader,
as well as the EDGE_FSM* / EDGE_COPY* business.

c) DOM changes the IL as it goes.  Though we could conceivably divorce
do the threading after DOM is done.
Yeah, it does not actually process/simplify the blocks copied by 
threading.
In fact I think it only registers jump threading opportunities during 
the DOM
walk and commits them only later.  But it of course uses its avail / 
copies

stack to find them - that part you cannot easily divorce.
Well, divorcing from using the context sensitive avail/copies is part of 
what Aldy & Andrew have been working on.  All indications I've seen are 
they're on track to be able to do that.


And yes, it only registers the threads and waits until after DOM is done 
to transform the CFG.  That in and of itself introduces all kinds of 
complexity.  If we can get to the point where we don't need the context 
sensitive avail/copies, then we've got a real shot at untangling DOM and 
threading which would be a huge maintainability win in my mind.




DOM is also yet another value-numbering framework - one could think
of stripping it down from that side, keeping the threading bits only
(but you'd still have the avail / copies bits).
Yes.  I think you and I touched on this a while back.    At a high level 
I'd prefer to have FRE rather than DOM doing the bulk of the redundant 
expression elimination.  The big blocker there was the tight integration 
of DOM and threading.  But if Aldy can untangle that we can then 
evaluate replacing DOM with FRE.


Once ranger does floats, I can't think of anything the forward threader 
could get that the backward threader couldn't.







That said, it has one nice property it can leverage due to its incredibly
simple memory redundancy handling, in that it usually performs way less
alias queries than FRE (even when you run the latter in non-iterative 
mode).
DOM as an infrastructure for optimization is probably reaching the end 
of its useful life.  FRE has a lot more going for it.




But the same way DOM can register jump threading opportunities FRE
could do as well.
I'd advise against that and instead look towards a model where no pass 
has integrated jump threading and the only jump threading module we have 
is the backwards threader.


Yes, please.  We need to separate jump threading from all passes.  One 
thing, and do it well.


Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-10 Thread Aldy Hernandez via Gcc





On 9/10/21 5:43 PM, Jeff Law wrote:



On 9/9/2021 3:21 AM, Aldy Hernandez wrote:




   /* If this path does not thread through the loop latch, then we are
  using the FSM threader to find old style jump threads. This
  is good, except the FSM threader does not re-use an existing
  threading path to reduce code duplication.

  So for that case, drastically reduce the number of statements
  we are allowed to copy.  */


*blink*

Woah.  The backward threader has been using FSM threads 
indiscriminately as far as I can remember.  I wonder what would break 
if we "fixed it".
?!?  I'm not sure what you're suggesting here.  If you s/FSM 
threader/backwards threader/ in the comment does it make more sense? The 
term FSM really should largely have been dropped as the backwards 
threader was improved to handle more cases.


back_threader_registry::register_path() uses EDGE_FSM_THREAD as the 
thread type to register threads.  I was wondering if it should have been 
some combination of EDGE_START_JUMP_THREAD / EDGE_*_COPY_SRC_BLOCK, etc. 
 I (purposely) know nothing about the underlying threading types ;-). 
But if the backwards threader has been improved, then perhaps we should 
just remove the confusing FSM references.











so these cases should use the "old style" validity/costing metrics 
and thus

classify threading opportunities in a different way?


Jeff, do you have any insight here?

This is precisely what you're cleaning up.






I think today "backwards" vs, "forwards" only refers to the way we find
threading opportunities.


Yes, it's a mess.

I ran some experiments a while back, and my current work on the 
enhanced solver/threader, can fold virtually everything the 
DOM/threader gets (even with its use of const_and_copies, avail_exprs, 
and evrp_range_analyzer), while getting 5% more DOM threads and 1% 
more overall threads.  That is, I've been testing if the path solver 
can solve everything the DOM threader needs (the hybrid approach I 
mentioned).


Unfortunately, replacing the forward threader right now is not 
feasible for a few reasons:
Right.  But I thought the short term goal was to replace/remove the 
forward threading from VRP.   Dropping from DOM is going to be tougher.


My current thinking is that replacing the forward VRP threader with a 
hybrid one is a gentler approach to the longer term goal of replacing 
the forward threader altogether.  However, all the work I've been doing 
could go either way-- we could try the forward/VRP replacement or a 
hybrid approach.  It will all use the path solver underneath.


My main problem with replacing the forward/VRP with a backward client is 
that the cost models are so different that it was difficult to compare 
how we fared.  I constantly ran into threads the solver could handle 
just fine, but profitable_path_p was holding it back.


FWIW, we get virtually everything the forward threader gets, minus a 
very few things.  At least when I plug in the solver to the 
DOM/forwarder threader, it can solve everything it can (minus noise and 
floats).


If you prefer a backward threader instance to replace the VRP/forward 
threader, I'm game.  It's just harder to compare.  Either way (backward 
threader or a hybrid forward+solver) uses the same underlying solver 
which is solid.


a) The const_and_copies/avail_exprs relation framework can do floats, 
and that's next year's ranger work.
Right.  I'd actually run into this as well when I wanted to drop all the 
range bits out of DOM and rely exclusively on EVRP.   It'd still be a 
step forward to rip out the EVRP engine from DOM and simplify all the 
code that derives one equivalence from another so that it's only working 
on FP values.


Sure.





b) Even though we can seemingly fold everything DOM/threader does, in 
order to replace it with a backward threader instance we'd have to 
merge the cost/profitability code scattered throughout the forward 
threader, as well as the EDGE_FSM* / EDGE_COPY* business.
Right.  This is a prerequisite.  Though some of the costing will need to 
be conditional on the threader being used.  Refer back to the discussion 
around how the forward threader can commonize thread paths that lead to 
the same destination while the backwards threader can not.


Yup, yup.





c) DOM changes the IL as it goes.  Though we could conceivably divorce 
do the threading after DOM is done.
The only reason threading runs in parallel with DOM is so that it can 
use the context sensitive equivalences.  With the infrastructure you're 
building, there's a reasonable chance we can move to a model where we 
run DOM (and in the long term a simpler DOM) and threading as distinct, 
independent passes.


Andrew mumbled something about replacing all of DOM eventually :-). 
Well, except that value-numbering business I bet.


Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-10 Thread Aldy Hernandez via Gcc

I think things are clear enough to propose a patch.  Thanks for 
everyone's input.


I have added some comments and tweaked Michael's patch to include the 
final BB where we're threading to, in the check.  Without this last 
check, we fail in tree-ssa/cunroll-15.c because the latch becomes 
polluted with PHI nodes.


OK for trunk?
Aldy

commit e735a2cd01485773635bd66c97af21059992d5a7 (HEAD -> 
pending/global-ranges)

Author: Aldy Hernandez 
Date:   Thu Sep 9 20:30:28 2021 +0200

Disable threading through latches until after loop optimizations.

The motivation for this patch was enabling the use of global ranges in
the path solver, but this caused certain properties of loops being
destroyed which made subsequent loop optimizations to fail.
Consequently, this patch's mail goal is to disable jump threading
involving the latch until after loop optimizations have run.

I have added a bit in cfun to indicate when the "loopdone" optimization
has completed.  This helps the path threader determine when it's 
safe to

thread through loop structures.  I can adapt the patch if there is an
alternate way of determining this.

As can be seen in the test adjustments, we mostly shift the threading
from the early threaders (ethread, thread[12] to the late threaders
thread[34]).  I have nuked some of the early notes in the testcases
that came as part of the jump threader rewrite.  They're mostly noise
now.

Note that we could probably relax some other restrictions in
profitable_path_p when loop optimizations have completed, but it would
require more testing, and I'm hesitant to touch more things than needed
at this point.  I have added a reminder to the function to keep this
in mind.

Finally, perhaps as a follow-up, we should apply the same 
restrictions to
the forward threader.  At some point I'd like to combine the cost 
models.


Tested on x86-64 Linux.

p.s. There is a thorough discussion involving the limitations of jump
threading involving loops here:

https://gcc.gnu.org/pipermail/gcc/2021-September/237247.html
>From e735a2cd01485773635bd66c97af21059992d5a7 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez 
Date: Thu, 9 Sep 2021 20:30:28 +0200
Subject: [PATCH] Disable threading through latches until after loop
 optimizations.

The motivation for this patch was enabling the use of global ranges in
the path solver, but this caused certain properties of loops being
destroyed which made subsequent loop optimizations to fail.
Consequently, this patch's mail goal is to disable jump threading
involving the latch until after loop optimizations have run.

I have added a bit in cfun to indicate when the "loopdone" optimization
has completed.  This helps the path threader determine when it's safe to
thread through loop structures.  I can adapt the patch if there is an
alternate way of determining this.

As can be seen in the test adjustments, we mostly shift the threading
from the early threaders (ethread, thread[12] to the late threaders
thread[34]).  I have nuked some of the early notes in the testcases
that came as part of the jump threader rewrite.  They're mostly noise
now.

Note that we could probably relax some other restrictions in
profitable_path_p when loop optimizations have completed, but it would
require more testing, and I'm hesitant to touch more things than needed
at this point.  I have added a reminder to the function to keep this
in mind.

Finally, perhaps as a follow-up, we should apply the same restrictions to
the forward threader.  At some point I'd like to combine the cost models.

Tested on x86-64 Linux.

p.s. There is a thorough discussion involving the limitations of jump
threading involving loops here:

	https://gcc.gnu.org/pipermail/gcc/2021-September/237247.html

gcc/ChangeLog:

	* function.h (struct function): Add loop_optimizers_done.
	(set_loop_optimizers_done): New.
	(loop_optimizers_done_p): New.
	* gimple-range-path.cc (path_range_query::internal_range_of_expr):
	Intersect with global range.
	* tree-ssa-loop.c (tree_ssa_loop_done): Call set_loop_optimizers_done.
	* tree-ssa-threadbackward.c
	(back_threader_profitability::profitable_path_p): Disable
	threading through latches until after loop optimizations have run.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Adjust for disabling of
	threading through latches.
	* gcc.dg/tree-ssa/ssa-dom-thread-6.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same.

Co-authored-by: Michael Matz 
---
 gcc/function.h| 15 
 gcc/gimple-range-path.cc  |  3 ++
 .../gcc.dg/tree-ssa/ssa-dom-thread-2b.c   |  4 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-6.c| 37 +--
 .../gcc.dg/tree-ssa/ssa-dom-thread-7.c| 17 +
 gcc/tree-ssa-loop.c   |  1 +
 gcc/tree-ssa-threadbackward.c | 28 +-
 7 files changed

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc





On 9/9/21 4:44 PM, Michael Matz wrote:

Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:


Here there's no simple latch block to start with (the backedge comes
directly out of the loop exit block).  So my suggested improvement
(testing if the latch was empty and only then reject the thread), would
solve this.


Well, there's the thing.  Loop discovery is marking BB5 as the latch, so
it's not getting threaded:


Yes, it's a latch, but not an empty one.  So the thread would make it just
even more non-empty, which might not be a problem anymore.  So amending my
patch somewhere with a strategic

   && empty_block_p (latch)

and only then rejecting the thread should make this testcase work again.


Sweet.



(There's still a catch, though: if this non-empty latch, which is also the
exit test block, is threaded through and is followed by actual code, then
that code will be inserted onto the back edge, not into the latch block
before the exit test, and so also create a (new) non-empty latch.  That
might or might not create problems downstreams, but not as many as
transformaing an empty into a non-empty latch would do; but it could
create for instance a second back edge (not in this testcase) and
suchlike)


BTW, I'm not sure your check for the non-last position makes a difference:


diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
-   threaded_through_latch = true;
+   {
+ threaded_through_latch = true;
+ if (j != 0)
+   latch_within_path = true;
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, " (latch)");
+   }
  }


If the last position being considered is a simple latch, it only has a simple
outgoing jump.  There's no need to thread that.  You need a block with >= 2
succ edges to thread anything.


So, you are saying that any candidate thread path wouldn't have the latch
in the last position if it were just an empty forwarder?  I simply wasn't
sure about that, so was conservative and only wanted to reject things I
knew where positively bad (namely code in the path following the latch
that is in danger of being moved into the latch).


Correct.  In the backwards threader, we don't start the party if the 
last block has < 2 succs:


  FOR_EACH_BB_FN (bb, fun)
{
  if (EDGE_COUNT (bb->succs) > 1)
threader.maybe_thread_block (bb);
}

Thanks.
Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc





On 9/9/21 2:52 PM, Michael Matz wrote:

Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:


The ldist-22 regression is interesting though:

void foo ()
{
   int i;

:
   goto ; [INV]

:
   a[i_1] = 0;
   if (i_1 > 100)
 goto ; [INV]
   else
 goto ; [INV]

:
   b[i_1] = i_1;

:
   i_8 = i_1 + 1;

:
   # i_1 = PHI <0(2), i_8(5)>
   if (i_1 <= 1023)
 goto ; [INV]
   else
 goto ; [INV]


Here there's no simple latch block to start with (the backedge comes
directly out of the loop exit block).  So my suggested improvement
(testing if the latch was empty and only then reject the thread), would
solve this.


Well, there's the thing.  Loop discovery is marking BB5 as the latch, so 
it's not getting threaded:


Checking profitability of path (backwards):  bb:6 (2 insns) bb:5 (latch)




Would it be crazy to suggest that we disable threading through latches
altogether,


I think it wouldn't be crazy, but we can do a bit better as suggested
above (only reject empty latches, and reject it only for the threaders
coming before the loop optims).


BTW, I'm not sure your check for the non-last position makes a difference:


diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
   loop_p loop = m_path[0]->loop_father;
   bool path_crosses_loops = false;
   bool threaded_through_latch = false;
+  bool latch_within_path = false;
   bool multiway_branch_in_path = false;
   bool threaded_multiway_branch = false;
   bool contains_hot_bb = false;
@@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
 the last entry in the array when determining if we thread
 through the loop latch.  */
   if (loop->latch == bb)
-   threaded_through_latch = true;
+   {
+ threaded_through_latch = true;
+ if (j != 0)
+   latch_within_path = true;
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, " (latch)");
+   }
 }


If the last position being considered is a simple latch, it only has a 
simple outgoing jump.  There's no need to thread that.  You need a block 
with >= 2 succ edges to thread anything.


Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc


On 9/9/21 12:15 PM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez  wrote:




On 9/9/21 10:58 AM, Richard Biener wrote:



I ran some experiments a while back, and my current work on the enhanced
solver/threader, can fold virtually everything the DOM/threader gets
(even with its use of const_and_copies, avail_exprs, and
evrp_range_analyzer), while getting 5% more DOM threads and 1% more
overall threads.  That is, I've been testing if the path solver can
solve everything the DOM threader needs (the hybrid approach I mentioned).

Unfortunately, replacing the forward threader right now is not feasible
for a few reasons:

a) The const_and_copies/avail_exprs relation framework can do floats,
and that's next year's ranger work.


Hmm, it sounds like relying fully on ranger is a bit premature then.


For floats, most definitely.  Ranger doesn't do them.  They're for the 
next release.  However, const_and_copies/avail_exprs for integers we can 
do just fine, that's why evrp/VRP are much easier targets.





b) Even though we can seemingly fold everything DOM/threader does, in
order to replace it with a backward threader instance we'd have to merge
the cost/profitability code scattered throughout the forward threader,
as well as the EDGE_FSM* / EDGE_COPY* business.

c) DOM changes the IL as it goes.  Though we could conceivably divorce
do the threading after DOM is done.


Yeah, it does not actually process/simplify the blocks copied by threading.
In fact I think it only registers jump threading opportunities during the DOM
walk and commits them only later.  But it of course uses its avail / copies
stack to find them - that part you cannot easily divorce.


Yes, all the threaders register paths ahead of time and only do the 
actual CFG shuffling at the end with thread_through_all_blocks().




DOM is also yet another value-numbering framework - one could think
of stripping it down from that side, keeping the threading bits only
(but you'd still have the avail / copies bits).


Ughh, yeah.  I've seen some random missed cases because of the 
value-numbering it does :-/.  It's very frustrating that we have various 
ad-hoc VN methods.




That said, it has one nice property it can leverage due to its incredibly
simple memory redundancy handling, in that it usually performs way less
alias queries than FRE (even when you run the latter in non-iterative mode).

But the same way DOM can register jump threading opportunities FRE
could do as well.


My plan for DOM/threading for this release is to replace its evrp 
analyzer with a path solver (path_range_query).  We can still use the 
avail/copies, and everything else should just work.  I'm hand waiving on 
a few missed cases due to IL changing, but last time I checked we got 
way more in return.  Andrew even thought we should get most things even 
in the presence of changing IL, but I haven't distilled testcases for him.


For VRP/threading, the same hybrid thing, but replacing the 
ASSERT_EXPRs, and the avail/copies with a path solver.  There are no 
floats here.  Things are looking good in my tests.


Once we do floats, if we could merge the EDGE_*_THREAD and the 
cost/profit bits between both threaders, we should be able to replace 
the entire forward threader with a backward client.  Hopefully I don't 
run out of steam by then ;-).


Thanks for your insight on DOM.  It's very helpful.

Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc





On 9/9/21 10:58 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez  wrote:




On 9/9/21 9:45 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:




On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:


Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

  [local count: 118111600]:
 # j_19 = PHI 
 sum_11 = c[j_19];
 if (n_10(D) > 0)
   goto ; [89.00%]
 else
   goto ; [11.00%]

 [local count: 105119324]:
...

  [local count: 118111600]:
 # sum_21 = PHI 
 c[j_19] = sum_21;
 j_13 = j_19 + 1;
 if (n_10(D) > j_13)
   goto ; [89.00%]
 else
   goto ; [11.00%]

  [local count: 105119324]:
 goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Lame comment at the current end of the thread - it's not threading through the


I don't know why y'all keep using the word "lame".  On the contrary,
these are incredibly useful explanations.  Thanks.


latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

 e2 = path->last ()->e;
 if (!e2 || noloop_only)
   {
 /* If NOLOOP_ONLY is true, we only allow threading through the
header of a loop to exit edges.  */

 /* One case occurs when there was loop header buried in a jump
threading path that cr

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc





On 9/9/21 9:45 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:




On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:


Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

 [local count: 118111600]:
# j_19 = PHI 
sum_11 = c[j_19];
if (n_10(D) > 0)
  goto ; [89.00%]
else
  goto ; [11.00%]

[local count: 105119324]:
...

 [local count: 118111600]:
# sum_21 = PHI 
c[j_19] = sum_21;
j_13 = j_19 + 1;
if (n_10(D) > j_13)
  goto ; [89.00%]
else
  goto ; [11.00%]

 [local count: 105119324]:
goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Lame comment at the current end of the thread - it's not threading through the


I don't know why y'all keep using the word "lame".  On the contrary,
these are incredibly useful explanations.  Thanks.


latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

e2 = path->last ()->e;
if (!e2 || noloop_only)
  {
/* If NOLOOP_ONLY is true, we only allow threading through the
   header of a loop to exit edges.  */

/* One case occurs when there was loop header buried in a jump
   threading path that crosses loop boundaries.  We do not try
   and thread this elsewhere, so just cancel the jump threading

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc

On 9/8/21 8:13 PM, Michael Matz wrote:

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:

The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see

The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)

I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

[local count: 118111600]:
   # j_19 = PHI 
   sum_11 = c[j_19];
   if (n_10(D) > 0)
 goto ; [89.00%]
   else
 goto ; [11.00%]

   [local count: 105119324]:
...

[local count: 118111600]:
   # sum_21 = PHI 
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
 goto ; [89.00%]
   else
 goto ; [11.00%]

[local count: 105119324]:
   goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)

Thanks for looking at this.

I think you're onto something with this approach.  Perhaps in addition 
to the loop header threading Richard mentions.

Your patch causes some regressions, but I think most are noise from FSM 
tests that must be adjusted.  However, there are some other ones that 
are curious:

> FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated 
memset zero"

> FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
< XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue 
"Performing shrink-wrapping"
< XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, 
line 36)
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
graphite "1 loops carried no dependency" 1
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
optimized "loopfn.1" 4
> FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times 
graphite "5 loops carried no dependency" 1

Interestingly your patch is fixing shrink-wrap-loop.c and 
Warray-bounds-87, both of which were introduced by the backward threader 
rewrite.  At least the Warray-bounds was the threader peeling off an 
iteration that caused a bogus warning.

The ldist-22 regression is interesting though:

void foo ()
{
  int i;

   :
  goto ; [INV]

   :
  a[i_1] = 0;
  if (i_1 >

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc





On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:


Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

[local count: 118111600]:
   # j_19 = PHI 
   sum_11 = c[j_19];
   if (n_10(D) > 0)
 goto ; [89.00%]
   else
 goto ; [11.00%]

   [local count: 105119324]:
...

[local count: 118111600]:
   # sum_21 = PHI 
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
 goto ; [89.00%]
   else
 goto ; [11.00%]

[local count: 105119324]:
   goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Lame comment at the current end of the thread - it's not threading through the


I don't know why y'all keep using the word "lame".  On the contrary, 
these are incredibly useful explanations.  Thanks.



latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

   e2 = path->last ()->e;
   if (!e2 || noloop_only)
 {
   /* If NOLOOP_ONLY is true, we only allow threading through the
  header of a loop to exit edges.  */

   /* One case occurs when there was loop header buried in a jump
  threading path that crosses loop boundaries.  We do not try
  and thread this elsewhere, so just cancel the jump threading
  request by clearing the AUX field now.  */
   if (bb->loop_father != e2->src->loop_father
   && (!loop_

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Aldy Hernandez via Gcc


On 9/8/21 3:49 PM, Richard Biener wrote:


On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez  wrote:



It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.


Absolutely!  I'm attaching the reduced testcase, as well as the patch.

The problematic thread shows up in the thread2 dump:

Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0
insns) bb:5
Control statement insns: 2
Overall: 2 insns
Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8)
nocopy; (3, 8)


Well, so as Micha said, the threading destroys the outer loop by
making it have two entries - we actually _do_ recover from this
but it results in a non-empty latch of the outer loop and thus
the loop is no longer a perfect nest.  The interchange pass has
special-sauce to recover from the loop store motion applied
but not to handle the kind of loop rotation the jump threading
caused.


Understood.  The backedge into BB8 and the non-empty latch seem to cause 
problems.




The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop, 
though.  BTW, the backward threader also disallows threading across 
different loops (see path_crosses_loops variable).



the threading done here should be 7->3 (outer loop entry) to bb 8
rather than one involving the backedge.  Also note the condition
is invariant in the loop and in fact subsumed by the condition
outside of the loop and it should have been simplified by
VRP after pass_ch but I see there's a jump threading pass
inbetween pass_ch and the next VRP which is likely the
problem.


A 7->3->8 thread would cross loops though, because 7 is outside the 
outer loop.




Why does jump threading not try to simplify a condition before
attempting to thread through it?  After all ranger should be around?


That's a very good question ;-).  The current code does not use the 
ranger to solve unknowns.  Anything further back than the first block in 
a path will return varying.  The plan was to add this ranger solving 
functionality as a follow-up.


I have a whole slew of pending patches adding precisely this 
functionality.  We should be able to solve ranges outside the path with 
ranger, as well as relationals occurring in a path.


However, even if there are alternate ways of threading this IL, 
something like 5->9->3 could still happen.  We need a way to disallow 
this.  I'm having a hard time determining the hammer for this.  I would 
vote for disabling threading through latches, but it seems the backward 
threader is aware of this scenario and allows it anyhow (see 
threaded_through_latch variable).  Ughh.


Thanks for looking into this.
Aldy

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Aldy Hernandez via Gcc


It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.


Absolutely!  I'm attaching the reduced testcase, as well as the patch.

The problematic thread shows up in the thread2 dump:

Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0 
insns) bb:5

  Control statement insns: 2
  Overall: 2 insns
  Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8) 
nocopy; (3, 8)


Thanks.
Aldy
commit 1bf3f76a5ff075396b5b9f5f88d6b18649dac2ce
Author: Aldy Hernandez 
Date:   Sun Sep 5 16:54:00 2021 +0200

Take into account global ranges when folding statements in path solver.

The path solver used by the backwards threader can refine a folded
range by taking into account any global range known.  This patch
intersects the folded range with whatever global information we have.

gcc/ChangeLog:

* gimple-range-path.cc (path_range_query::internal_range_of_expr):
Intersect with global ranges.

> FAIL: gcc.dg/tree-ssa/loop-interchange-9.c scan-tree-dump-times linterchange "Loop_pair is interchanged" 1
> FAIL: gcc.dg/tree-ssa/cunroll-15.c scan-tree-dump optimized "return 1;"

diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
index a4fa3b296ff..c616b65756f 100644
--- a/gcc/gimple-range-path.cc
+++ b/gcc/gimple-range-path.cc
@@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt)
   basic_block bb = stmt ? gimple_bb (stmt) : exit_bb ();
   if (stmt && range_defined_in_block (r, name, bb))
 {
+  if (TREE_CODE (name) == SSA_NAME)
+	r.intersect (gimple_range_global (name));
+
   set_cache (r, name);
   return true;
 }
/* { dg-do run } */
/* { dg-options "-O2 -floop-interchange -fdump-tree-linterchange-details" } */
/* { dg-require-effective-target size20plus } */
/* { dg-skip-if "too big data segment" { visium-*-* } } */

#define M 256
int a[M][M], b[M][M], c[M], d[M];
void __attribute__((noinline))
simple_reduc_1 (int n)
{
  for (int j = 0; j < n; j++)
{
  int sum = c[j];
  for (int i = 0; i < n; i++)
	sum = sum + a[i][j]*b[i][j];

  c[j] = sum;
}
}

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-08 Thread Aldy Hernandez via Gcc

First of all.  Thanks so much for this incredibly useful explanation. 
It helps a lot.


I'm still trying to wrap my head around this, so please bear with me.

On 9/7/21 4:45 PM, Michael Matz wrote:

Hello,

On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:


The regression comes from the simple_reduc_1() function in
tree-ssa/loop-interchange-9.c, and it involves the following path:

=== BB 5 
Imports: n_10(D)  j_19
Exports: n_10(D)  j_13  j_19
  j_13 : j_19(I)
n_10(D) int [1, 257]
j_19int [0, 256]
Relational : (j_13 > j_19)
  [local count: 118111600]:
 # sum_21 = PHI 
 c[j_19] = sum_21;
 j_13 = j_19 + 1;
 if (n_10(D) > j_13)
   goto ; [89.00%]
 else
   goto ; [11.00%]


So, this is the outer loops exit block ...


=== BB 9 
n_10(D) int [2, 257]
j_13int [1, 256]
Relational : (n_10(D) > j_19)
Relational : (n_10(D) > j_13)
  [local count: 105119324]:
 goto ; [100.00%]


... this the outer loops latch block ...


=== BB 3 
Imports: n_10(D)
Exports: n_10(D)
n_10(D) int [1, +INF]
  [local count: 118111600]:
 # j_19 = PHI 
 sum_11 = c[j_19];
 if (n_10(D) > 0)
   goto ; [89.00%]
 else
   goto ; [11.00%]


... and this the outer loops header, as well as inner loops pre-header.


Actually, the inner loop's pre-header seems to be BB8, which is empty, 
and leads into the BB4 which is the header of the inner loop.



The attempted thread hence involves a back-edge (of the outer loop) and a
loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
destroy some properties that our loop optimizers rely on, e.g. that the
loop header of a natural loop is only reached by two edges: entry edge and
back edge, and that the latch blocks are empty and that there's only a
single latch.  (What exactly we require depends on several flags in
loops_state_satisfies_p).


But threading 3->8 doesn't involve a loop-entry edge.  It involves a new 
edge into the *pre-header* of the inner loop.


I am attaching the IL right after threading and before we clean up the 
CFG (il-after-threading.txt).


After threading we have have rewritten the 5->9 edge into 5->11:

   [local count: 118111600]:
  # sum_21 = PHI 
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 105119324]:
  # j_7 = PHI 
  sum_6 = c[j_19];
  goto ; [100.00%]

   [local count: 0]:;; This becomes unreachable.
  goto ; [100.00%]

Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.

Question: could BB12 not be considered the loop latch now and BB8 be the 
outer loop's entry?  This would also mean, that BB3 is now outside of 
the outer loop, which means the threader peeled off the first iteration 
of the loop.  Or is it a requirement that the latch be empty?





With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
the loop cannot legally iterate outside the size of c[256].  So, j_13
lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
This allows us to thread BB3 to BB8.


So, IIUC doing this threading would create a new entry to BB8: it would
then be entered by its natural entry (bb3), by its natural back edge
(whatever bb that is now) and the new entry from the threaded outer back
edge (bb9 or bb5?).


As I mentioned, BB8 looks like the pre-header to the inner loop.  But 
yes, it now has multiple entries: the edge from BB3, and the back edge 
from BB12 (which seems like it could be a new latch, since it's the only 
back edge).




The inner loop wouldn't then be recognized as natural anymore and the
whole nest not as perfect, and hence loop interchange can't easily happen
anymore.  Most other structural loop optimizers of us would have problems
with that situation as well.


If I clean up the CFG and call loop_optimizer_init() again at this 
point, what is destroyed is the outer loop, not the inner loop.  If you 
look at the IL at this point 
(il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 
going back up to BB4, though the conditional is now in BB6 (eeech):


  
...
...
...


   [local count: 118111600]:
  # sum_21 = PHI 
  # j_18 = PHI 
  c[j_18] = sum_21;
  j_13 = j_18 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  # j_7 = PHI 
  sum_6 = c[j_7];
  goto ; [100.00%]

Perhaps, this looks sufficiently different that the loop optimizer can't 
recognize it as a loop?





All the blocks lie within the same loop, and the path passes all the
tests in path_profitable_p().

Is there something about the shape of this path that should make it
invalid in the backward threader, or should the test be marked with
-fdisable-tree-threadN (or something else entirely)?


This is a functionality test chec

More aggressive threading causing loop-interchange-9.c regression

2021-09-07 Thread Aldy Hernandez via Gcc


Hi folks.

I have a pending patch for the path solver that pulls in global ranges 
when available (the stuff in SSA_NAME_RANGE_INFO).  In doing so, I have 
run into a regression I was hoping the loop experts could pontificate on.


The regression comes from the simple_reduc_1() function in 
tree-ssa/loop-interchange-9.c, and it involves the following path:


=== BB 5 
Imports: n_10(D)  j_19
Exports: n_10(D)  j_13  j_19
 j_13 : j_19(I)
n_10(D) int [1, 257]
j_19int [0, 256]
Relational : (j_13 > j_19)
 [local count: 118111600]:
# sum_21 = PHI 
c[j_19] = sum_21;
j_13 = j_19 + 1;
if (n_10(D) > j_13)
  goto ; [89.00%]
else
  goto ; [11.00%]

j_13 : int [1, 257]
5->9  (T) n_10(D) :  int [2, 257]
5->9  (T) j_13 : int [1, 256]
5->9  (T) j_19 : int [0, 255]
5->6  (F) n_10(D) :  int [1, 257]
5->6  (F) j_13 : int [1, 257]
5->6  (F) j_19 : int [0, 256]

=== BB 9 
n_10(D) int [2, 257]
j_13int [1, 256]
Relational : (n_10(D) > j_19)
Relational : (n_10(D) > j_13)
 [local count: 105119324]:
goto ; [100.00%]


=== BB 3 
Imports: n_10(D)
Exports: n_10(D)
n_10(D) int [1, +INF]
 [local count: 118111600]:
# j_19 = PHI 
sum_11 = c[j_19];
if (n_10(D) > 0)
  goto ; [89.00%]
else
  goto ; [11.00%]

With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that 
the loop cannot legally iterate outside the size of c[256].  So, j_13 
lies within [1, 257] and n_10 is [2, +INF] at the end of the path.  This 
allows us to thread BB3 to BB8.  Doing so, causes the test to fail here:


/* { dg-final { scan-tree-dump-times "Loop_pair is 
interchanged" 1 "linterchange" } } */


All the blocks lie within the same loop, and the path passes all the 
tests in path_profitable_p().


Is there something about the shape of this path that should make it 
invalid in the backward threader, or should the test be marked with 
-fdisable-tree-threadN (or something else entirely)?  Note that the 
backward threader is allowed to peel through loop headers.


I am attaching the gimple dump as well as the graphviz output for easier 
analysis.


Thanks.
Aldy
__attribute__((noinline))
void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

   [local count: 14598063]:
  if (n_10(D) > 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 12992276]:

   [local count: 118111600]:
  # j_19 = PHI 
  sum_11 = c[j_19];
  if (n_10(D) > 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 955630225]:
  # sum_20 = PHI 
  # i_22 = PHI 
  _1 = a[i_22][j_19];
  _2 = b[i_22][j_19];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 850510901]:
  goto ; [100.00%]

   [local count: 118111600]:
  # sum_21 = PHI 
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  goto ; [100.00%]

   [local count: 14598063]:
  return;

}


graph.ps
Description: PostScript document

Re: Hooks fixed to treat trunk the same as master

2021-08-03 Thread Aldy Hernandez via Gcc





On 8/3/21 9:33 AM, Martin Liška wrote:

On 8/2/21 7:22 PM, Joseph Myers wrote:

Hey.


Some time ago, someone added a git symbolic-ref for refs/heads/trunk
pointing to refs/heads/master.


Great you found out what caused that. We were aware of commits that didn't
pass gcc-verify check but for some reason went in.



A side effect of this was to introduce a loophole in the checks run via
commit hooks, some of which are configured to apply only to master and
release branches and didn't apply if commits were pushed instead to trunk
(this also meant that commits pushed to trunk didn't result in Bugzilla
updates, and the emails to gcc-cvs didn't show the r12- short
identifier for the commit).

In particular, the nightly DATESTAMP and ChangeLog updates have been
broken for the past few nights because of a commit pushed to trunk with
bad ChangeLog entries, which only got detected in the nightly cron job
rather than at the time of the original push.


I'm going to fix that with Jakub. The problematic commit is from Aldy:
2e96b5f14e4025691b57d2301d71aa6092ed44bc:

git gcc-verify 2e96b5f14e4025691b57d2301d71aa6092ed44bc

Checking 2e96b5f14e4025691b57d2301d71aa6092ed44bc: FAILED

ERR: unchanged file mentioned in a ChangeLog: "gcc/Makefile.in"

ERR: unchanged file mentioned in a ChangeLog (did you mean 
"gcc/testsuite/gcc.dg/analyzer/pr94851-2.c"?): 
"gcc/testsuite/dg.dg/analyzer/pr94851-2.c"


ERR: changed file not mentioned in a ChangeLog: 
"gcc/testsuite/gcc.dg/analyzer/pr94851-2.c"



Aldy, can you please verify that you pushed the commit to 'trunk' branch 
name?


Sure, that sounds like me :).  I have some local branches based off of 
trunk.  For some reason I thought they were the same.


Sorry for the problems this has caused, and thanks for amending my commit.

I take it it's now ok to treat trunk and master the same?

Aldy

Re: Failures building glibc with mainline GCC

2021-07-30 Thread Aldy Hernandez via Gcc

There's a new jump threader in GCC which is much more aggressive, and
may trigger latent problems with other warning passes, especially
-Warray-bounds, -Woverflow, and -Wuninitialized.

Do your problems go away if you take out commit 2e96b5f14e?

I have notes throughout the commit analyzing variants of the above
warnings (for example my addendum to gcc.c-torture/compile/pr83510.c).
So far, all the warnings I've seen are legitimate jump threading
opportunities that we are now doing, but which throw off the warning
passes.

Before the inclusion of the new threader, I had warned that this could
happen.  Perhaps we'll have to come up a way to reduce the false
positives.

Aldy

On Fri, Jul 30, 2021 at 5:31 PM Joseph Myers  wrote:
>
> There are a lot of failures building glibc with mainline GCC right now
> 
> (previously, there were ICEs building glibc on various architectures, so
> these might be hard to bisect):
>
>
> * x86_64-linux-gnu: "error: array subscript 0 is outside array bounds of
> '__seg_fs struct pthread * __seg_fs[0]' [-Werror=array-bounds]".  This is
> the one discussed in
> .
>
> In file included from ../sysdeps/generic/libc-tsd.h:44,
>  from ../include/../locale/localeinfo.h:224,
>  from ../include/ctype.h:26,
>  from loadmsgcat.c:29:
> loadmsgcat.c: In function '_nl_load_domain':
> ../sysdeps/x86_64/nptl/tls.h:185:4: error: array subscript 0 is outside array 
> bounds of '__seg_fs struct pthread * __seg_fs[0]' [-Werror=array-bounds]
>   185 |   (*(struct pthread *__seg_fs *) offsetof (struct pthread, 
> header.self))
>   |   
> ~^
> ../sysdeps/nptl/libc-lock.h:92:18: note: in expansion of macro 'THREAD_SELF'
>92 | void *self = THREAD_SELF; 
> \
>   |  ^~~
> loadmsgcat.c:770:3: note: in expansion of macro '__libc_lock_lock_recursive'
>   770 |   __libc_lock_lock_recursive (lock);
>   |   ^~
>
>
> * i686-gnu:
>
> hurdselect.c: In function '_hurd_select':
> hurdselect.c:555:7: error: 'ss' may be used uninitialized in this function 
> [-Werror=maybe-uninitialized]
>   555 |   _hurd_sigstate_unlock (ss);
>   |   ^~
>
> As far as I can tell, this is a false positive from the compiler (this
> code is only reached if (sigport != MACH_PORT_NULL), in which case ss has
> been initialized).
>
>
> * Several architectures (all of them 32-bit), powerpc-linux-gnu for
> example:
>
> In file included from t.61.c:437:
> In function 'from_t_61_single',
> inlined from 'gconv' at ../iconv/skeleton.c:568:15:
> ../iconv/loop.c:440:22: error: writing 1 byte into a region of size 0 
> [-Werror=stringop-overflow=]
>   440 | bytebuf[inlen++] = *inptr++;
>   | ~^~
> ../iconv/loop.c: In function 'gconv':
> ../iconv/loop.c:382:17: note: at offset 2 into destination object 'bytebuf' 
> of size 2
>   382 |   unsigned char bytebuf[MAX_NEEDED_INPUT];
>   | ^~~
>
> I don't know if this is a false positive or not.
>
>
> * powerpc64-linux-gnu:
>
> In file included from ../sysdeps/powerpc/dl-tls.c:20:
> In function '_dl_allocate_tls_init',
> inlined from '_dl_allocate_tls' at ../elf/dl-tls.c:621:10:
> ../elf/dl-tls.c:529:10: error: array subscript -1 is outside array bounds of 
> 'void[9223372036854775807]' [-Werror=array-bounds]
>   529 |   dtv_t *dtv = GET_DTV (result);
>   |  ^~~
> In file included from ../elf/dl-tls.c:28,
>  from ../sysdeps/powerpc/dl-tls.c:20:
> ../sysdeps/powerpc/nptl/tls.h:136:34: error: array subscript -1 is outside 
> array bounds of 'void[9223372036854775807]' [-Werror=array-bounds]
>   136 |   ((tcbhead_t *) (tcbp))[-1].dtv = dtvp + 1
>   |   ~~~^~
> ../elf/dl-tls.c:544:7: note: in expansion of macro 'INSTALL_DTV'
>   544 |   INSTALL_DTV (result, &dtv[-1]);
>   |   ^~~
>
>
> * powerpc64le-linux-gnu: "error: '-mabi=ibmlongdouble' requires
> '-mlong-double-128'".  See
> .
>
>
> --
> Joseph S. Myers
> jos...@codesourcery.com
>

Re: replacing the backwards threader and more

2021-06-29 Thread Aldy Hernandez via Gcc





On 6/30/21 12:16 AM, Martin Sebor wrote:

On 6/28/21 6:32 PM, Aldy Hernandez wrote:
I had changed my approach to passing -Wno-free-nonheap-object in the 
Makefile. Can you try disabling the Makefile bit and bootstrapping, 
cause that was still failing.


I see.  I just tested it and my patch does let the existing #pragma
suppress the warning.  I just pinged the core patch yesterday so
once it and the rest of the series gets approved there will be no
need for the Makefile workaround.


Thanks for taking care of this.
Aldy

Re: replacing the backwards threader and more

2021-06-28 Thread Aldy Hernandez via Gcc

I had changed my approach to passing -Wno-free-nonheap-object in the
Makefile. Can you try disabling the Makefile bit and bootstrapping, cause
that was still failing.

Thanks.
Aldy

On Tue, Jun 29, 2021, 00:29 Martin Sebor  wrote:

> On 6/28/21 12:17 AM, Aldy Hernandez wrote:
> >
> >
> > On 6/25/21 7:19 PM, Martin Sebor wrote:
> >> On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:
> >>> Hi folks.
> >>>
> >>> I'm done with benchmarking, testing and cleanups, so I'd like to post
> >>> my patchset for review.  However, before doing so, I'd like to
> >>> address a handful of meta-issues that may affect how I post these
> >>> patches.
> >>>
> >>> Trapping on differences
> >>> ===
> >>>
> >>> Originally I wanted to contribute verification code that would trap
> >>> if the legacy code threaded any edges the new code couldn't (to be
> >>> removed after a week).  However, after having tested on various
> >>> architectures and only running once into a missing thread, I'm
> >>> leaning towards omitting the verification code, since it's fragile,
> >>> time consuming, and quite hacky.
> >>>
> >>> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le.
> >>> There is only one case, across bootstrap and regression tests where
> >>> the verification code is ever tripped (discussed below).
> >>>
> >>> Performance
> >>> ===
> >>>
> >>> I re-ran benchmarks as per our callgrind suite, and the penalty with
> >>> the current pipeline is 1.55% of overall compilation time.  As is
> >>> being discussed, we should be able to mitigate this significantly by
> >>> removing other threading passes.
> >>>
> >>> Failing testcases
> >>> =
> >>>
> >>> I have yet to run into incorrect code being generated, but I have had
> >>> to tweak a considerable number of tests.  I have verified every
> >>> single discrepancy and documented my changes in the testsuite when it
> >>> merited doing so.  However, there are a couple tests that trigger
> >>> regressions and I'd like to ask for guidance on how to address them.
> >>>
> >>> 1. gcc.c-torture/compile/pr83510.c
> >>>
> >>> I would like to XFAIL this.
> >>>
> >>> What happens here is that thread1 threads a switch statement such
> >>> that the various cases have been split into different independent
> >>> blocks. One of these blocks exposes an arr[i_27] access which is
> >>> later propagated by VRP to be arr[10].  This is an invalid access,
> >>> but the array bounds code doesn't know it is an unreachable path.
> >>
> >> The test has a bunch of loops that iterate over the 10 array elements.
> >> There have been bug reports about loop unrolling causing false positives
> >> -Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
> >> the same issue.
> >>
> >>>
> >>> However, it is not until dom2 that we "know" that the value of the
> >>> switch index is such that the path to arr[10] is unreachable.  For
> >>> that matter, it is not until dom3 that we remove the unreachable path.
> >>
> >> If you do XFAIL it can you please isolate a small test case and open
> >> a bug and make it a -Warray-bounds blocker?
> >
> > Will do.
> >
> >>
> >>>
> >>> 2. -Wfree-nonheap-object
> >>>
> >>> This warning is triggered while cleaning up an auto_vec.  I see that
> >>> the va_heap::release() inline is wrapped with a pragma ignore
> >>> "-Wfree-nonheap-object", but this is not sufficient because jump
> >>> threading may alter uses in such a way that may_emit_free_warning()
> >>> will warn on the *inlined* location, thus bypassing the pragma.
> >>>
> >>> I worked around this with a mere:
> >>>
> >>>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
> >>>>location_t loc = tree_inlined_location (exp);
> >>>> +  loc = EXPR_LOCATION (exp);
> >>>
> >>> but this causes a ton of Wfree-nonheap* tests to fail.  I think
> >>> someone more knowledgeable should address this (msebor??).
> >>
> >> This sounds like the same proble

Re: replacing the backwards threader and more

2021-06-28 Thread Aldy Hernandez via Gcc





On 6/28/21 10:22 AM, Richard Biener wrote:

On Fri, Jun 25, 2021 at 6:20 PM Aldy Hernandez  wrote:


Hi folks.

I'm done with benchmarking, testing and cleanups, so I'd like to post my
patchset for review.  However, before doing so, I'd like to address a
handful of meta-issues that may affect how I post these patches.


First of all thanks for the detailed analysis and report!


Trapping on differences
===

Originally I wanted to contribute verification code that would trap if
the legacy code threaded any edges the new code couldn't (to be removed
after a week).  However, after having tested on various architectures
and only running once into a missing thread, I'm leaning towards
omitting the verification code, since it's fragile, time consuming, and
quite hacky.


Agreed.


For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le.
There is only one case, across bootstrap and regression tests where the
verification code is ever tripped (discussed below).

Performance
===

I re-ran benchmarks as per our callgrind suite, and the penalty with the
current pipeline is 1.55% of overall compilation time.  As is being
discussed, we should be able to mitigate this significantly by removing
other threading passes.


1.55% for the overall compilation looks quite large - is this suite particularly
demanding on VRP/ranger?  Is the figure for 'cc1files' similar? (collect
preprocessed sources of gcc/*.{c,cc} compiles, like by appending
-save-temps to CFLAGS in a non-bootstrap build)


The figures are for cc1 files.  They are .ii files we run under 
callgrind and compare "instruction counts" as per the tool.  We've found 
it's comparable to -ftime-report, though not exactly.  We've been using 
this for benchmarking, since evrp times were so small that it was hard 
to compare user times.




I'm curious if you tried to look at the worst offenders and if there's
anything new, threading specific.  More threading might result in
larger code (how is cc1 size affected by the change?) and more
code to compile by later passes can easily explain >1% differences.


A while ago, I did look at worst offenders on the branch, but there was 
nothing obvious.  As it stands now, the average cost for the threading 
pass is 858%, with 95% of .ii files falling under 1800%.  I could look 
at the outliers which seem to be c-fold, gcc-ar, gcc-ranlib, gcc-nm, and 
c-opts.


cc1 on x86-64 seems to be 0.135% larger for an --enable-checking build.




Failing testcases
=

I have yet to run into incorrect code being generated, but I have had to
tweak a considerable number of tests.  I have verified every single
discrepancy and documented my changes in the testsuite when it merited
doing so.  However, there are a couple tests that trigger regressions
and I'd like to ask for guidance on how to address them.

1. gcc.c-torture/compile/pr83510.c

I would like to XFAIL this.

What happens here is that thread1 threads a switch statement such that
the various cases have been split into different independent blocks.
One of these blocks exposes an arr[i_27] access which is later
propagated by VRP to be arr[10].  This is an invalid access, but the
array bounds code doesn't know it is an unreachable path.

However, it is not until dom2 that we "know" that the value of the
switch index is such that the path to arr[10] is unreachable.  For that
matter, it is not until dom3 that we remove the unreachable path.


Sounds reasonable.


2. -Wfree-nonheap-object

This warning is triggered while cleaning up an auto_vec.  I see that the
va_heap::release() inline is wrapped with a pragma ignore
"-Wfree-nonheap-object", but this is not sufficient because jump
threading may alter uses in such a way that may_emit_free_warning() will
warn on the *inlined* location, thus bypassing the pragma.

I worked around this with a mere:

  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)

location_t loc = tree_inlined_location (exp);
+  loc = EXPR_LOCATION (exp);


but this causes a ton of Wfree-nonheap* tests to fail.  I think someone
more knowledgeable should address this (msebor??).


That looks wrong, but see msebors response for maybe some better answer.


Oh, it was just a stop-gap.  I have instead opted for the following 
temporary measure in Makefile.in:


tree-ssa-loop-im.o-warn = -Wno-free-nonheap-object

I will work with Martin on a more appropriate solution.




3. uninit-pred-9_b.c

The uninit code is getting confused with the threading and the bogus
warning in line 24 is back.  I looked at the thread, and it is correct.

I'm afraid all these warnings are quite fragile in the presence of more
aggressive optimizations, and I suspect it will only get worse.


Yep.  Shouldn't be a blocker, just XFAIL ...


4. libphobos/src/std/net/isemail.d

This is a D test where we don't actually fail, but we trigger the
verification code.  It is the only jump threading edge that the new code
fails to get over the old

Re: replacing the backwards threader and more

2021-06-27 Thread Aldy Hernandez via Gcc





On 6/25/21 7:19 PM, Martin Sebor wrote:

On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:

Hi folks.

I'm done with benchmarking, testing and cleanups, so I'd like to post 
my patchset for review.  However, before doing so, I'd like to address 
a handful of meta-issues that may affect how I post these patches.


Trapping on differences
===

Originally I wanted to contribute verification code that would trap if 
the legacy code threaded any edges the new code couldn't (to be 
removed after a week).  However, after having tested on various 
architectures and only running once into a missing thread, I'm leaning 
towards omitting the verification code, since it's fragile, time 
consuming, and quite hacky.


For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le. 
There is only one case, across bootstrap and regression tests where 
the verification code is ever tripped (discussed below).


Performance
===

I re-ran benchmarks as per our callgrind suite, and the penalty with 
the current pipeline is 1.55% of overall compilation time.  As is 
being discussed, we should be able to mitigate this significantly by 
removing other threading passes.


Failing testcases
=

I have yet to run into incorrect code being generated, but I have had 
to tweak a considerable number of tests.  I have verified every single 
discrepancy and documented my changes in the testsuite when it merited 
doing so.  However, there are a couple tests that trigger regressions 
and I'd like to ask for guidance on how to address them.


1. gcc.c-torture/compile/pr83510.c

I would like to XFAIL this.

What happens here is that thread1 threads a switch statement such that 
the various cases have been split into different independent blocks. 
One of these blocks exposes an arr[i_27] access which is later 
propagated by VRP to be arr[10].  This is an invalid access, but the 
array bounds code doesn't know it is an unreachable path.


The test has a bunch of loops that iterate over the 10 array elements.
There have been bug reports about loop unrolling causing false positives
-Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
the same issue.



However, it is not until dom2 that we "know" that the value of the 
switch index is such that the path to arr[10] is unreachable.  For 
that matter, it is not until dom3 that we remove the unreachable path.


If you do XFAIL it can you please isolate a small test case and open
a bug and make it a -Warray-bounds blocker?


Will do.





2. -Wfree-nonheap-object

This warning is triggered while cleaning up an auto_vec.  I see that 
the va_heap::release() inline is wrapped with a pragma ignore 
"-Wfree-nonheap-object", but this is not sufficient because jump 
threading may alter uses in such a way that may_emit_free_warning() 
will warn on the *inlined* location, thus bypassing the pragma.


I worked around this with a mere:

 > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)

   location_t loc = tree_inlined_location (exp);
+  loc = EXPR_LOCATION (exp);


but this causes a ton of Wfree-nonheap* tests to fail.  I think 
someone more knowledgeable should address this (msebor??).


This sounds like the same problem as PR 98871.  Does the patch below
fix it?
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572515.html
If so, I suggest getting that patch in first to avoid testsuite
failures.  If it doesn't fix it I'll look into it before you commit
your changes.


The latest patch in the above thread does not apply.  Do you have a more 
recent patch?



3. uninit-pred-9_b.c

The uninit code is getting confused with the threading and the bogus 
warning in line 24 is back.  I looked at the thread, and it is correct.


I'm afraid all these warnings are quite fragile in the presence of 
more aggressive optimizations, and I suspect it will only get worse.


 From my recent review of open -Wmaybe-uninitialized bugs (and
the code) it does seem to be both fragile and getting worse.  I've
only found a few simple problems so far in the code but nothing that
would make a dramatic difference so I can't say if it's possible to
do much better, but I'm not done or ready to give up.  If you XFAIL
this too please open a bug for it and make it a blocker for
-Wuninitialized?


Will do.

Good to know these are somewhat known issues.

Thanks.
Aldy

Re: replacing the backwards threader and more

2021-06-25 Thread Aldy Hernandez via Gcc


Hi folks.

I'm done with benchmarking, testing and cleanups, so I'd like to post my 
patchset for review.  However, before doing so, I'd like to address a 
handful of meta-issues that may affect how I post these patches.


Trapping on differences
===

Originally I wanted to contribute verification code that would trap if 
the legacy code threaded any edges the new code couldn't (to be removed 
after a week).  However, after having tested on various architectures 
and only running once into a missing thread, I'm leaning towards 
omitting the verification code, since it's fragile, time consuming, and 
quite hacky.


For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le. 
There is only one case, across bootstrap and regression tests where the 
verification code is ever tripped (discussed below).


Performance
===

I re-ran benchmarks as per our callgrind suite, and the penalty with the 
current pipeline is 1.55% of overall compilation time.  As is being 
discussed, we should be able to mitigate this significantly by removing 
other threading passes.


Failing testcases
=

I have yet to run into incorrect code being generated, but I have had to 
tweak a considerable number of tests.  I have verified every single 
discrepancy and documented my changes in the testsuite when it merited 
doing so.  However, there are a couple tests that trigger regressions 
and I'd like to ask for guidance on how to address them.


1. gcc.c-torture/compile/pr83510.c

I would like to XFAIL this.

What happens here is that thread1 threads a switch statement such that 
the various cases have been split into different independent blocks. 
One of these blocks exposes an arr[i_27] access which is later 
propagated by VRP to be arr[10].  This is an invalid access, but the 
array bounds code doesn't know it is an unreachable path.


However, it is not until dom2 that we "know" that the value of the 
switch index is such that the path to arr[10] is unreachable.  For that 
matter, it is not until dom3 that we remove the unreachable path.


2. -Wfree-nonheap-object

This warning is triggered while cleaning up an auto_vec.  I see that the 
va_heap::release() inline is wrapped with a pragma ignore 
"-Wfree-nonheap-object", but this is not sufficient because jump 
threading may alter uses in such a way that may_emit_free_warning() will 
warn on the *inlined* location, thus bypassing the pragma.


I worked around this with a mere:

> @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)

   location_t loc = tree_inlined_location (exp);
+  loc = EXPR_LOCATION (exp);


but this causes a ton of Wfree-nonheap* tests to fail.  I think someone 
more knowledgeable should address this (msebor??).


3. uninit-pred-9_b.c

The uninit code is getting confused with the threading and the bogus 
warning in line 24 is back.  I looked at the thread, and it is correct.


I'm afraid all these warnings are quite fragile in the presence of more 
aggressive optimizations, and I suspect it will only get worse.


4. libphobos/src/std/net/isemail.d

This is a D test where we don't actually fail, but we trigger the 
verification code.  It is the only jump threading edge that the new code 
fails to get over the old code, and it only happens on ppc64.


It triggers because a BB4 -> BB5 is too expensive to thread, but a BBn 
-> BB3 -> BB4 -> BB5 is considered safe to thread because BB3 is a latch 
and it alters the profitability equation.  The reason we don't get it, 
is that we assume that if a X->Y is unprofitable, it is not worth 
looking at W->X->Y and so forth.


Jeff had some fancy ideas on how to attack this.  Once such idea was to 
stop looking back, but only for things we were absolutely sure would 
never yield a profitable path.  I tried a subset of this, by allowing 
further looks on this latch test, but my 1.55% overall performance 
penalty turned into an 8.33% penalty.  Personally it looks way too 
expensive for this one isolated case.  Besides, the test where this 
clamping code originally came from still succeeds (commit 
eab2541b860c48203115ac6dca3284e982015d2c).


CONCLUSION
==

That's basically it.

If we agree the above things are not big issues, or can be addressed as 
follow-ups, I'd like to start the ball rolling on the new threader. 
This would allow more extensive testing of the code, and separate it a 
bit from the other big changes coming up :).


Aldy

Re: replacing the backwards threader and more

2021-06-21 Thread Aldy Hernandez via Gcc





On 6/9/21 2:09 PM, Richard Biener wrote:

On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc  wrote:


Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader's
dependency on evrp, turned into a rewrite of the backwards threader
code.  I'd like to discuss the possibility of replacing the current
backwards threader with a new one that gets far more threads and can
potentially subsume all threaders in the future.

I won't include code here, as it will just detract from the high level
discussion.  But if it helps, I could post what I have, which just needs
some cleanups and porting to the latest trunk changes Andrew has made.

Currently the backwards threader works by traversing DEF chains through
PHIs leading to possible paths that start in a constant.  When such a
path is found, it is checked to see if it is profitable, and if so, the
constant path is threaded.  The current implementation is rather limited
since backwards paths must end in a constant.  For example, the
backwards threader can't get any of the tests in
gcc.dg/tree-ssa/ssa-thread-14.c:

if (a && b)
  foo ();
if (!b && c)
  bar ();

etc.

After my refactoring patches to the threading code, it is now possible
to drop in an alternate implementation that shares the profitability
code (is this path profitable?), the jump registry, and the actual jump
threading code.  I have leveraged this to write a ranger-based threader
that gets every single thread the current code gets, plus 90-130% more.

Here are the details from the branch, which should be very similar to
trunk.  I'm presenting the branch numbers because they contain Andrew's
upcoming relational query which significantly juices up the results.

New threader:
   ethread:65043(+3.06%)
   dom:32450  (-13.3%)
   backwards threader:72482   (+89.6%)
   vrp:40532  (-30.7%)
Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threading
opportunities than the code I want to replace.  In doing so, it reduces
the amount of DOM threading opportunities by 13.3% and by 30.7% from the
VRP jump threader.  The total  improvement across the jump threading
opportunities in the compiler is 6.70%.

However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and VRP
now get are not because they're smarter, but because they're picking up
opportunities that the new code exposes.  I experimented with running an
iterative threader, and then seeing what VRP and DOM could actually get.
   This is too expensive to do in real life, but it at least shows what
the effect of the new code is on DOM/VRP's abilities:

Iterative threader:
  ethread:65043(+3.06%)
  dom:31170(-16.7%)
  thread:86717(+127%)
  vrp:33851(-42.2%)
Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but it
reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
respectively).   The end result is that we have the possibility of
getting almost 10% more jump threading opportunities in the entire
compilation run.


Yeah, DOM once was iterating ...

You probably have noticed that we have very man (way too many)
'thread' passes, often in close succession with each other or
DOM or VRP.  So in the above numbers I wonder if you can break
down the numbers individually for the actual passes (in their order)?


As promised.

*** LEGACY:
ethread42:61152 30.1369% (61152 threads for 30.1% of total)
thread117:29646 14.6101%
vrp118:62088 30.5982%
thread132:2232 1.09997%
dom133:31116 15.3346%
thread197:1950 0.960998%
dom198:10661 5.25395%
thread200:587 0.289285%
vrp201:3482 1.716%
Total:  202914

The above is from current trunk with my patches applied, defaulting to 
legacy mode.  It follows the pass number nomenclature in the 
*.statistics files.


New threader code (This is what I envision current trunk to look with my 
patchset):


*** RANGER:
ethread42:64389 30.2242%
thread117:49449 23.2114%
vrp118:46118 21.6478%
thread132:8153 3.82702%
dom133:27168 12.7527%
thread197:5542 2.60141%
dom198:8191 3.84485%
thread200:1038 0.487237%
vrp201:2990 1.40351%
Total:  213038

New threader code running in iterative mode (illustrative purposes only):

*** RANGER: iterative mode
ethread42:64389 28.895%
thread117:76013 34.1113%
vrp118:31823 14.2808%
thread132:7165 3.21534%
dom133:25881 11.6143%
thread197:6226 2.79396%
dom198:7536 3.38183%
thread200:971 0.435743%
vrp201:2834 1.27178%
Total:  222838

The above run in iterative mode is only used to show what the new code 
fails to get since dom* and vrp* are still threading some things despite 
repeated runs of the new code.


As I've mentioned, I haven't dug deep into what VRP* and DOM* are 
threading over the new threader code.  Last time I checked it was some 
combination of:

Re: Aldy Hernandez and Andrew MacLeod as VRP maintainers

2021-06-17 Thread Aldy Hernandez via Gcc




On 6/17/21 12:18 AM, Jeff Law wrote:
I am pleased to announce that the GCC Steering Committee has appointed 
Aldy Hernandez and Ian MacLeod as maintainers for the VRP subsystem 
(EVRP, VRP, Ranger).


I don't know who this Ian is, but I'm sure we'll get along splendidly :).



Aldy/Andrew, please update the MAINTAINERS file appropriately.


Done.

Thanks.
Aldy
>From 9f12bd79c0bd162cdbfab528f2e8dac43fb53d68 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez 
Date: Thu, 17 Jun 2021 09:49:43 +0200
Subject: [PATCH] Add amacleod and aldyh as *vrp and ranger maintainers.

ChangeLog:

	* MAINTAINERS (Various Maintainers): Add Andrew and myself
	as *vrp and ranger maintainers.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index a0e6c718ea5..e04477831eb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -224,6 +224,8 @@ option handling		Joseph Myers		
 middle-end		Jeff Law		
 middle-end		Ian Lance Taylor	
 middle-end		Richard Biener		
+*vrp, ranger		Aldy Hernandez		
+*vrp, ranger		Andrew MacLeod		
 tree-ssa		Andrew MacLeod		
 tree browser/unparser	Sebastian Pop		
 scev, data dependence	Sebastian Pop		
-- 
2.31.1

Re: replacing the backwards threader and more

2021-06-14 Thread Aldy Hernandez via Gcc





On 6/15/21 6:03 AM, Jeff Law wrote:



On 6/14/2021 12:40 AM, Richard Biener wrote:



I bet it's going to be tougher to remove DOM's threader.  It knows how
to do thinks like resolve memory references using temporary equivalences
and such.  But I bet it's enough to drop the VRP based threaders.

Yes.  In fact I am wondering if adding threading to the not iterating FRE
would make it possible to drop DOM, replacing it with instances of FRE.

I'd think so.  I'd approach as:

1. Remove the VRP threader instances.
2. Drop cprop and redundancy elimination from DOM using FRE instead
3. What's left of DOM is just forward jump threading.  Revamp & 
simplify.  Then either make it a distinct pass or as a sub-pass of FRE.


But one could just as easily look at adding threading to FRE and just 
killing DOM and its jump threading.


Andrew will hopefully be contributing the relational work this week. 
Once he does so, I will gather more granular stats for VRP1/VRP2 so we 
can assess the situation.


Also, a bunch of tests needed to be tweaked because the new code picks 
up so many threads.  I'm waiting for the relational work to avoid having 
to adjust the tests again.


I would personally prefer to add an item 0 to the above list: replace 
current backwards threader code with the rewrite.  I would hate to 
replace all threaders at once and deal with the fallout.  Perhaps it's 
better to replace the backward threaders, and once that settles move 
onto VRP[12]??


Aldy

[ranger-tech] How to use ranges within any pass with get_range_query()

2021-06-10 Thread Aldy Hernandez via Gcc

As part of the ranger development we have found that there are various 
pieces of the infrastructure that can be used independently, and we’d 
like to document them in the hopes that they are useful to others.  We 
will be contributing short posts documenting parts of the ranger, which 
we hope can become part of a more persistent documentation (wiki? 
Internal docs?).


This first post is an introduction to accessing ranges throughout the 
compiler that will serve as a basis for the upcoming posts.


Please let us know if anything is not clear, or if you'd like us to 
expand on any particular topic...


Aldy & Andrew

The main goal of ranger is to provide a generic infrastructure for 
accessing ranges (global or on-demand) from anywhere in the compiler, 
with one common API.  This API subsumes the global ranges we accessed 
through SSA_NAME_RANGE_INFO, as well as on-demand ones with the ranger 
proper.


The base class for querying ranges in the compiler is range_query and is 
defined in value-query.h.  Instead of duplicating it here, I will list 
the more typical access points:


bool range_of_expr (irange &r, tree expr, gimple *stmt = NULL);

Returns the range of a tree expression as a range in R, with an optional 
gimple statement providing context.  The tree expression EXPR can be 
anything from an SSA, or constant, to a full complex expression such as 
a binary or unary tree.


Upon return R will contain the range of EXPR as it would appear on entry 
to STMT.  If STMT is not specified, then it will be the global range of 
EXPR.  This function always returns true unless the type of EXPR is 
unsupported (we currently support integers and pointers).


bool range_on_edge (irange &r, edge, tree expr);

Like range_of_expr, but instead of returning the range as it appears on 
entry to a statement, this function returns the range as it would appear 
on an edge.


bool range_of_stmt (irange &r, gimple *, tree name = NULL);

Returns the range of a gimple statement in R.  NAME is an optional 
SSA_NAME on the LHS of the statement.  It is only required if there is 
more than one LHS/output. This query will trigger requests for the range 
of each operand on the statement and will use those to calculate the 
resulting range. (ie, range_of_expr() will be called for each operand 
using this statement as the context)


The above is the core API for anything range related.  It can be used 
with any range_query object, which ranger is one, and which even the 
legacy vr_values is one.


Every struct function in the compiler has a range_query object 
associated with it.  By default, it is configured such that it returns 
global ranges.  That is, ranges that were globally assigned by previous 
passes, such as evrp or VRP.  These global ranges are what 
SSA_NAME_RANGE_INFO and SSA_NAME_PTR_INFO used to be, but accessible 
with one common API, and flexible in that they can return global ranges 
for SSA names, constants, and even expressions.


To get the range_query object for a given function, use the following 
with the above API:


range_query *get_range_query (struct function *);

The first step in using ranges in a pass, is to structure all queries 
with the range API, on an object returned by get_range_query().  That’s 
it.  Your pass will be able to access ranges, albeit initially with 
ranges that apply to the entire function (global ranges).


If your pass can benefit from context-aware ranges (on statements or 
edges) or on-demand ranges (more up to date than global ones), you must 
enable a ranger for your pass.  This can be done by calling the 
following on entry to the pass:


gimple_ranger *enable_ranger (struct function *fun);

And a corresponding call on exit from the pass:

void disable_ranger (struct function *fun);

No other changes are needed in your pass if you’re already using the 
range_query API.  You may continue using get_range_query(fun) since it 
will return the current active range_query object (the enabled ranger in 
this case).


You may notice that enable_ranger() returns a gimple_ranger object 
(which is a derived class of range_query).  This can be used for more 
advanced operations on ranges (see gimple-range.h), but more 
importantly, it can be used to export any ranges found throughout the 
execution of your pass to the global space.  If during range queries 
done in your pass, the ranger discovers any globally applicable ranges, 
they can be exported for use in subsequent passes by calling the 
export_global_ranges() method from a gimple_ranger object:


your_pass()
{
  gimple_ranger *ranger = enable-ranger (cfun);

  do_stuff();
  get_range_query ()->range_of_expr (.);
  get_range_query ()->range_on_edge ();
  do_stuff();

  // Export any known ranges to the global space.
  ranger->export_global_ranges ();

  disable_ranger (cfun);
}

This means that on exit from the pass, a get_range_query()->range* can 
be used to access globally applicable ranges that were found during 
your_

Re: replacing the backwards threader and more

2021-06-09 Thread Aldy Hernandez via Gcc





On 6/9/21 9:47 PM, Jeff Law wrote:



On 6/9/2021 9:34 AM, Aldy Hernandez wrote:



On 6/9/21 2:09 PM, Richard Biener wrote:
On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc 
 wrote:


Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader's
dependency on evrp, turned into a rewrite of the backwards threader
code.  I'd like to discuss the possibility of replacing the current
backwards threader with a new one that gets far more threads and can
potentially subsume all threaders in the future.

I won't include code here, as it will just detract from the high level
discussion.  But if it helps, I could post what I have, which just 
needs

some cleanups and porting to the latest trunk changes Andrew has made.

Currently the backwards threader works by traversing DEF chains through
PHIs leading to possible paths that start in a constant.  When such a
path is found, it is checked to see if it is profitable, and if so, the
constant path is threaded.  The current implementation is rather 
limited

since backwards paths must end in a constant.  For example, the
backwards threader can't get any of the tests in
gcc.dg/tree-ssa/ssa-thread-14.c:

    if (a && b)
  foo ();
    if (!b && c)
  bar ();

etc.

After my refactoring patches to the threading code, it is now possible
to drop in an alternate implementation that shares the profitability
code (is this path profitable?), the jump registry, and the actual jump
threading code.  I have leveraged this to write a ranger-based threader
that gets every single thread the current code gets, plus 90-130% more.

Here are the details from the branch, which should be very similar to
trunk.  I'm presenting the branch numbers because they contain Andrew's
upcoming relational query which significantly juices up the results.

New threader:
   ethread:65043    (+3.06%)
   dom:32450  (-13.3%)
   backwards threader:72482   (+89.6%)
   vrp:40532  (-30.7%)
    Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threading
opportunities than the code I want to replace.  In doing so, it reduces
the amount of DOM threading opportunities by 13.3% and by 30.7% from 
the

VRP jump threader.  The total  improvement across the jump threading
opportunities in the compiler is 6.70%.

However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and 
VRP

now get are not because they're smarter, but because they're picking up
opportunities that the new code exposes.  I experimented with 
running an
iterative threader, and then seeing what VRP and DOM could actually 
get.

   This is too expensive to do in real life, but it at least shows what
the effect of the new code is on DOM/VRP's abilities:

    Iterative threader:
  ethread:65043    (+3.06%)
  dom:31170    (-16.7%)
  thread:86717    (+127%)
  vrp:33851    (-42.2%)
    Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but it
reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
respectively).   The end result is that we have the possibility of
getting almost 10% more jump threading opportunities in the entire
compilation run.


Yeah, DOM once was iterating ...

You probably have noticed that we have very man (way too many)
'thread' passes, often in close succession with each other or
DOM or VRP.  So in the above numbers I wonder if you can break
down the numbers individually for the actual passes (in their order)?


Sure, I can do that.  Let me whip up the old branch and gather some info.
I'll save you some time.  The jump threaders in VRP are doing the least 
amount of lifting and the ones we want to kill first.  IIRC the one from 
vrp2 is doing nearly nothing at this point.


Sure, that was going to be my next target.

What are your thoughts on replacing the current backwards threader, 
though?  That's basically ready to go.


Aldy

Re: replacing the backwards threader and more

2021-06-09 Thread Aldy Hernandez via Gcc





On 6/9/21 2:09 PM, Richard Biener wrote:

On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc  wrote:


Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader's
dependency on evrp, turned into a rewrite of the backwards threader
code.  I'd like to discuss the possibility of replacing the current
backwards threader with a new one that gets far more threads and can
potentially subsume all threaders in the future.

I won't include code here, as it will just detract from the high level
discussion.  But if it helps, I could post what I have, which just needs
some cleanups and porting to the latest trunk changes Andrew has made.

Currently the backwards threader works by traversing DEF chains through
PHIs leading to possible paths that start in a constant.  When such a
path is found, it is checked to see if it is profitable, and if so, the
constant path is threaded.  The current implementation is rather limited
since backwards paths must end in a constant.  For example, the
backwards threader can't get any of the tests in
gcc.dg/tree-ssa/ssa-thread-14.c:

if (a && b)
  foo ();
if (!b && c)
  bar ();

etc.

After my refactoring patches to the threading code, it is now possible
to drop in an alternate implementation that shares the profitability
code (is this path profitable?), the jump registry, and the actual jump
threading code.  I have leveraged this to write a ranger-based threader
that gets every single thread the current code gets, plus 90-130% more.

Here are the details from the branch, which should be very similar to
trunk.  I'm presenting the branch numbers because they contain Andrew's
upcoming relational query which significantly juices up the results.

New threader:
   ethread:65043(+3.06%)
   dom:32450  (-13.3%)
   backwards threader:72482   (+89.6%)
   vrp:40532  (-30.7%)
Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threading
opportunities than the code I want to replace.  In doing so, it reduces
the amount of DOM threading opportunities by 13.3% and by 30.7% from the
VRP jump threader.  The total  improvement across the jump threading
opportunities in the compiler is 6.70%.

However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and VRP
now get are not because they're smarter, but because they're picking up
opportunities that the new code exposes.  I experimented with running an
iterative threader, and then seeing what VRP and DOM could actually get.
   This is too expensive to do in real life, but it at least shows what
the effect of the new code is on DOM/VRP's abilities:

Iterative threader:
  ethread:65043(+3.06%)
  dom:31170(-16.7%)
  thread:86717(+127%)
  vrp:33851(-42.2%)
Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but it
reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
respectively).   The end result is that we have the possibility of
getting almost 10% more jump threading opportunities in the entire
compilation run.


Yeah, DOM once was iterating ...

You probably have noticed that we have very man (way too many)
'thread' passes, often in close succession with each other or
DOM or VRP.  So in the above numbers I wonder if you can break
down the numbers individually for the actual passes (in their order)?


Sure, I can do that.  Let me whip up the old branch and gather some info.




(Note that the new code gets even more opportunities, but I'm only
reporting the profitable ones that made it all the way through to the
threader backend, and actually eliminated a branch.)

The overall compilation hit from this work is currently 1.38% as
measured by callgrind.  We should be able to reduce this a bit, plus we
could get some of that back if we can replace the DOM and VRP threaders
(future work).

My proposed implementation should be able to get any threading
opportunity, and will get more as range-ops and ranger improve.

I can go into the details if necessary, but the gist of it is that we
leverage the import facility in the ranger to only look up paths that
have a direct repercussion in the conditional being threaded, thus
reducing the search space.  This enhanced path discovery, plus an engine
to resolve conditionals based on knowledge from a CFG path, is all that
is needed to register new paths.  There is no limit to how far back we
look, though in practice, we stop looking once a path is too expensive
to continue the search in a given direction.

The solver API is simple:

// This class is a thread path solver.  Given a set of BBs indicating
// a path through the CFG, range_in_path() will return the range
// of an SSA as if the BBs in the path would have been executed in
// order.
//
// Note that the block

replacing the backwards threader and more

2021-06-09 Thread Aldy Hernandez via Gcc


Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader's 
dependency on evrp, turned into a rewrite of the backwards threader 
code.  I'd like to discuss the possibility of replacing the current 
backwards threader with a new one that gets far more threads and can 
potentially subsume all threaders in the future.


I won't include code here, as it will just detract from the high level 
discussion.  But if it helps, I could post what I have, which just needs 
some cleanups and porting to the latest trunk changes Andrew has made.


Currently the backwards threader works by traversing DEF chains through 
PHIs leading to possible paths that start in a constant.  When such a 
path is found, it is checked to see if it is profitable, and if so, the 
constant path is threaded.  The current implementation is rather limited 
since backwards paths must end in a constant.  For example, the 
backwards threader can't get any of the tests in 
gcc.dg/tree-ssa/ssa-thread-14.c:


  if (a && b)
foo ();
  if (!b && c)
bar ();

etc.

After my refactoring patches to the threading code, it is now possible 
to drop in an alternate implementation that shares the profitability 
code (is this path profitable?), the jump registry, and the actual jump 
threading code.  I have leveraged this to write a ranger-based threader 
that gets every single thread the current code gets, plus 90-130% more.


Here are the details from the branch, which should be very similar to 
trunk.  I'm presenting the branch numbers because they contain Andrew's 
upcoming relational query which significantly juices up the results.


New threader:
 ethread:65043(+3.06%)
 dom:32450  (-13.3%)
 backwards threader:72482   (+89.6%)
 vrp:40532  (-30.7%)
  Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threading 
opportunities than the code I want to replace.  In doing so, it reduces 
the amount of DOM threading opportunities by 13.3% and by 30.7% from the 
VRP jump threader.  The total  improvement across the jump threading 
opportunities in the compiler is 6.70%.


However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and VRP 
now get are not because they're smarter, but because they're picking up 
opportunities that the new code exposes.  I experimented with running an 
iterative threader, and then seeing what VRP and DOM could actually get. 
 This is too expensive to do in real life, but it at least shows what 
the effect of the new code is on DOM/VRP's abilities:


  Iterative threader:
ethread:65043(+3.06%)
dom:31170(-16.7%)
thread:86717(+127%)
vrp:33851(-42.2%)
  Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but it 
reduces the DOM and VRP opportunities considerably (16.7% and 42.2% 
respectively).   The end result is that we have the possibility of 
getting almost 10% more jump threading opportunities in the entire 
compilation run.


(Note that the new code gets even more opportunities, but I'm only 
reporting the profitable ones that made it all the way through to the 
threader backend, and actually eliminated a branch.)


The overall compilation hit from this work is currently 1.38% as 
measured by callgrind.  We should be able to reduce this a bit, plus we 
could get some of that back if we can replace the DOM and VRP threaders 
(future work).


My proposed implementation should be able to get any threading 
opportunity, and will get more as range-ops and ranger improve.


I can go into the details if necessary, but the gist of it is that we 
leverage the import facility in the ranger to only look up paths that 
have a direct repercussion in the conditional being threaded, thus 
reducing the search space.  This enhanced path discovery, plus an engine 
to resolve conditionals based on knowledge from a CFG path, is all that 
is needed to register new paths.  There is no limit to how far back we 
look, though in practice, we stop looking once a path is too expensive 
to continue the search in a given direction.


The solver API is simple:

// This class is a thread path solver.  Given a set of BBs indicating
// a path through the CFG, range_in_path() will return the range
// of an SSA as if the BBs in the path would have been executed in
// order.
//
// Note that the blocks are in reverse order, thus the exit block is 
path[0].


class thread_solver : gori_compute
{

public:
  thread_solver (gimple_ranger &ranger);
  virtual ~thread_solver ();
  void set_path (const vec *, const bitmap_head *imports);
  void range_in_path (irange &, tree name);
  void range_in_path (irange &, gimple *);
...
};

Basically, as we're discovering paths, we ask the solver what the value 
of the final conditional in a BB is in a given path.  If it resolves, we 
register the path.


A follow-up project would be to analyze wh

Re: irange best practices document

2020-09-10 Thread Aldy Hernandez via Gcc


On 9/9/20 7:03 PM, Martin Sebor wrote:

On 9/3/20 1:14 AM, Aldy Hernandez via Gcc wrote:
Below is a documented we have drafted to provide guidance on using 
irange's and converting passes to it.  This will future proof any such 
passes so that they will work with the ranger, or any other mechanism 
using multiple sub-ranges (as opposed to the more limited value_range).


The official document will live here, but is included below for 
discussion:


 https://gcc.gnu.org/wiki/irange-best-practices

Feel free to respond to this post with any questions or comments.


Thanks for the writeup!

The biggest question on my mind at the moment is probably about
how to go about rewriting classes that maintain their own ranges
(most often as an array of two offset_int) to use one of the Ranger
classes.  Two examples are the access_ref class in builtins.h and
the builtin_memref class in gimple-ssa-warn-restrict.c.

Both of these compute the size of an object (as a simple range)
and the offset into it (also as a range).  In the future, they will
track of sizes of multiple objects (from PHI nodes).

My thinking is to do this in two steps: In 1) replace each
offset_int[2] member array with an instance of int_range<1> and
then rewrite all the code that manipulates the array with calls
to the ranger API. That should be fairly straightforward. In 2)


This sounds like a sound approach.  Instead of passing pairs of 
integers, pass an irange and limit yourself to the non-deprecated part 
of the API.



replace the simple int_range<1> with something more interesting
(int_range_max?) and rewrite the final code that processes


The ranger will use whatever sized range you pass it.  So if you want 
fine granularity, pass it an int_range_max.


However, if your final destination in memory is constrained (you're 
storing a gazillion of these), you can use the irange_pool to allocate 
the minimum amount of storage needed.  Note, that the irange_pool has 
not been contributed yet.  It'll come with the ranger.


Taking the strlen code as an example, you could transform the following:

  vr_values *v = CONST_CAST (vr_values *, rvals);
  const value_range_equiv *vr = v->get_value_range (si->nonzero_chars);
  if (vr->kind () != VR_RANGE || !range_int_cst_p (vr))
return false;

...into:

int_range_max r;
if (!ranger->range_of_expr (r, si->nonzero_chars, si->stmt))
  return false;

Where ranger is declared somewhere at the top of your pass.  Perhaps in 
printf_strlen_execute() and pass it around like you do rvals.


Note that for better precision, you should pass the gimple context. 
That is, the statement from which si->nonzero_chars came from.  In the 
above case, I think you want si->stmt.  Currently get_value_range() has 
a STMT parameter which is unused.  This was the reason for providing 
such parameter.


Finally... you can store R into an int_range depending on how much 
space you want to use in your ultimate storage location.  The assignment 
operator will squish R into whatever space you give it.  So the 
following will truncate things appropriately:


int_range<3> final_location = r;

However, if you're using irange_pool to allocate just the amount of 
space needed, you could do:


// Presumably declared wherever you declared the ranger.
irange_pool pool;
...
...
irange *p = pool.allocate (r);

p will hold a pointer to a range that contains R with no extra space. 
This is ideal for longer term storage.  For intermediate results, use 
int_range_max.


Note that I am already converting sprintf/strlen to use the ranger 
instead of evrp/vr_values (aldyh/ranger-staging git branch).  So the 
work of converting to range_of_expr above, and passing around gimple 
context, will already be done with the contribution of ranger.  What I 
will not provide for now, is the wholesale conversion of other passes to 
the "clean" irange API.  So if your pass is still using kind() and 
stashing away pairs of offset_int's, that will have to be done 
separately.  I can offer guidance though, and help as time permits.



the results to scan all the subranges for overflow or overlap,
as well as the code that presents them in warnings/notes.  It
would be nice to have support for the formatting of ranges in
the pretty-printer to cut down on the repetitive tests that
determine how to format a constant (%E, vs a range [%E, %E],
vs a multi-range [%E, %E][%E, %E], ...[%E, %E]).


I'm not a big fan of including ranges in warnings/errors.  The ranger 
and the vrp twins, are bound to change over releases with finer and 
finer ranges.  Having our error messages depend on ranges that may or 
may not change can be confusing for users, and it means we must change 
regression tests to match a changing compiler at each release.


I would much rather see a generic warning of "P may be out of bounds" 
than "P may be [X,Y][Z,A][C,D]"

irange best practices document

2020-09-03 Thread Aldy Hernandez via Gcc

Below is a documented we have drafted to provide guidance on using 
irange's and converting passes to it.  This will future proof any such 
passes so that they will work with the ranger, or any other mechanism 
using multiple sub-ranges (as opposed to the more limited value_range).


The official document will live here, but is included below for discussion:

https://gcc.gnu.org/wiki/irange-best-practices

Feel free to respond to this post with any questions or comments.

Aldy & Andrew


INTRODUCTION


irange is a class for storing and manipulating multi-ranges of
integers.  It is meant to be a replacement for value_range's, and can
currently inter-operate seamlessly with them.

Original value_range's can contain a range of integers, say from 10 to
20 inclusive, represented as [10, 20].  It also has a way of
representing an anti-range, which is the inverse of a range.  For
example, everything except 10 to 20 is represented as an anti-range of
~[10, 20].  This is really, shorthand for the union of
[MIN_INT, 9] U [21, MAX_INT].

The value_range representation has the limitation that higher
granularity is not representable without losing precision.  For
example, you cannot specify the range of [10, 15] U [20, 30], instead
it is represented ambiguously with with a range of [10, 30].

On the other hand, multi-ranges with the irange class, can represent
an arbitrary number of sub-ranges.  More formally, multi-ranges have
0 or more non-intersecting sub-ranges with integral bounds.  For
example, you can specify a range containing the numbers of [10, 15]
and [20, 30] with an irange of [10, 15][20, 30].  With irange, instead
of using anti-ranges for ~[10, 20] the underlying number of ranges can
be represented accurately with [MIN_INT, 9][21, MAX_INT].

Multi-ranges are not limited to 1 or 2 sub-ranges.  Instead, you can
specify any number of sub-ranges.  For example:

 int_range<5> five_pairs;
 int_range<2> two_pairs;
 int_range<1> legacy_value_range;
 widest_irange huge_irange;  // currently 255 sub-ranges.

The special case of int_range<1> provides legacy support for 
value_range.  Currently, value_range is just a typedef for int_range<1>.


The specially named widest_irange is used for "unlimited" sub-ranges,
and is meant to be used when calculating intermediate results.
Currently it is a large number (255), but could be changed without
prior notice.  Note that "widest" does not have anything to do with
the range of the underlying integers, but the maximum amount of
sub-range pairs available for calculation.

Here are some examples of calculations with different sub-range
granularity:

// Assume:
tree n10 = build_int_cst (integer_type_node, 10);
tree n20 = build_int_cst (integer_type_node, 20);
tree n30 = build_int_cst (integer_type_node, 30);
tree n40 = build_int_cst (integer_type_node, 40);
tree n50 = build_int_cst (integer_type_node, 50);
tree n60 = build_int_cst (integer_type_node, 60);

int_range<1> (aka value_range):

int_range<1> r1 (n10, n20);
int_range<1> r2 (n30, n40);
r1.union_ (r2);
r1.dump ();

// Outputs: [10, 40]

// Note: Since result doesn't fit in an int_range<1>, it is
// conservatively truncated.

int_range<2>:

int_range<2> r1 (n10, n20);
int_range<2> r2 (n30, n40);
r1.union_ (r2);
r1.dump ();

// Outputs: [10, 20][30, 40]

int_range<2> r3 (n50, n60);
r1.union_ (r3);
r1.dump ();

// Outputs: [10, 20][30, 60]

// Note: Unioning [10, 20][30, 40] and [50, 60] doesn't fit
// into an int_range<2>, so it is truncated.

int_range<1> small;
small = r1;  // r1 is [10,20][30,60] as above.
small.dump ();

// Outputs: [10, 60]

// Note: Copying a larger range into a smaller range will
// result in truncating the result.

widest_irange:

widest_irange r;
for (i = 0; i <= 40; i += 10) {
tree low = build_int_cst (integer_type_node, i);
tree high = build_int_cst (integer_type_node, i + 1);
int_range<2> tmp (low, high);
r.union_ (tmp);
}
r.dump ();

// Outputs: [0, 1][10, 11][20, 21][30, 31][40, 41]

WRITE YOUR CODE FOR INFINITE SUB-RANGES
---

Your code should be agnostic to the sub-ranges there-in.  For example,
if you wanted to check if the numbers 5 through 10 are in a range R,
avoid looking at min/max/kind and having to special case VR_RANGE,
VR_ANTI_RANGE, and VR_VARYING.  Instead do something like this:

widest_irange my_range (build_int_cst (5, integer_type_node),
build_int_cst (10, integer_type_node));
my_range.intersect (R);
if (R == my_range)
return goodness;

If on the other hand, if you want to check if there is ANY o

54 matches

Mail list logo