replacing the backwards threader and more

Aldy Hernandez via Gcc Wed, 09 Jun 2021 04:49:47 -0700

Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader'sdependency on evrp, turned into a rewrite of the backwards threadercode. I'd like to discuss the possibility of replacing the currentbackwards threader with a new one that gets far more threads and canpotentially subsume all threaders in the future.

I won't include code here, as it will just detract from the high leveldiscussion. But if it helps, I could post what I have, which just needssome cleanups and porting to the latest trunk changes Andrew has made.

Currently the backwards threader works by traversing DEF chains throughPHIs leading to possible paths that start in a constant. When such apath is found, it is checked to see if it is profitable, and if so, theconstant path is threaded. The current implementation is rather limitedsince backwards paths must end in a constant. For example, thebackwards threader can't get any of the tests ingcc.dg/tree-ssa/ssa-thread-14.c:


  if (a && b)
    foo ();
  if (!b && c)
    bar ();

etc.

After my refactoring patches to the threading code, it is now possibleto drop in an alternate implementation that shares the profitabilitycode (is this path profitable?), the jump registry, and the actual jumpthreading code. I have leveraged this to write a ranger-based threaderthat gets every single thread the current code gets, plus 90-130% more.

Here are the details from the branch, which should be very similar totrunk. I'm presenting the branch numbers because they contain Andrew'supcoming relational query which significantly juices up the results.


New threader:
         ethread:65043    (+3.06%)
         dom:32450      (-13.3%)
         backwards threader:72482   (+89.6%)
         vrp:40532      (-30.7%)
  Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threadingopportunities than the code I want to replace. In doing so, it reducesthe amount of DOM threading opportunities by 13.3% and by 30.7% from theVRP jump threader. The total improvement across the jump threadingopportunities in the compiler is 6.70%.


However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and VRPnow get are not because they're smarter, but because they're picking upopportunities that the new code exposes. I experimented with running aniterative threader, and then seeing what VRP and DOM could actually get.This is too expensive to do in real life, but it at least shows whatthe effect of the new code is on DOM/VRP's abilities:


  Iterative threader:
    ethread:65043    (+3.06%)
    dom:31170    (-16.7%)
        thread:86717    (+127%)
        vrp:33851    (-42.2%)
  Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but itreduces the DOM and VRP opportunities considerably (16.7% and 42.2%respectively). The end result is that we have the possibility ofgetting almost 10% more jump threading opportunities in the entirecompilation run.

(Note that the new code gets even more opportunities, but I'm onlyreporting the profitable ones that made it all the way through to thethreader backend, and actually eliminated a branch.)

The overall compilation hit from this work is currently 1.38% asmeasured by callgrind. We should be able to reduce this a bit, plus wecould get some of that back if we can replace the DOM and VRP threaders(future work).

My proposed implementation should be able to get any threadingopportunity, and will get more as range-ops and ranger improve.

I can go into the details if necessary, but the gist of it is that weleverage the import facility in the ranger to only look up paths thathave a direct repercussion in the conditional being threaded, thusreducing the search space. This enhanced path discovery, plus an engineto resolve conditionals based on knowledge from a CFG path, is all thatis needed to register new paths. There is no limit to how far back welook, though in practice, we stop looking once a path is too expensiveto continue the search in a given direction.


The solver API is simple:

// This class is a thread path solver.  Given a set of BBs indicating
// a path through the CFG, range_in_path() will return the range
// of an SSA as if the BBs in the path would have been executed in
// order.
//

// Note that the blocks are in reverse order, thus the exit block ispath[0].


class thread_solver : gori_compute
{

public:
  thread_solver (gimple_ranger &ranger);
  virtual ~thread_solver ();
  void set_path (const vec<basic_block> *, const bitmap_head *imports);
  void range_in_path (irange &, tree name);
  void range_in_path (irange &, gimple *);
...
};

Basically, as we're discovering paths, we ask the solver what the valueof the final conditional in a BB is in a given path. If it resolves, weregister the path.

A follow-up project would be to analyze what DOM/VRP are actuallygetting that we don't, because in theory with an enhanced ranger, weshould be able to get everything they do (minus some float stuff, andsome CSE things DOM does). However, IMO, this is good enough to atleast replace the current backwards threading code.

My suggestion would be to keep both implementations, defaulting to theranger based, and running the old code immediately after-- trapping ifit can find any threading opportunities. After a few weeks, we couldkill the old code.


Thoughts?

Aldy

p.s. BTW, ranger-based is technically a minomer. It's gori based. Wedon't need the entire ranger caching ability here. I'm only using it toget the imports for the interesting conditionals, since those are static.

replacing the backwards threader and more

Reply via email to