In order to replace VRP, we need an alternative for the jump threader
embedded within it. Currently, this threader is a forward threader
client that uses ASSERT_EXPRs and the avail/const framework to resolve
statements along a path.
As I have mentioned in the past weeks, I am proposing a hybrid
replacement, where we still use the forward threader, but with the path
solver used by the backward threader. This provides an incremental path
to remove the forward threader with minimal disruption.
As a reminder, the reason we can't just use the backward threader for a
VRP-threader replacement right now is because the forward and backward
threaders have different cost models and block copiers that have not
been merged. I'd like to address these differences for the next
release. This, coupled with support for floats in ranger, would allow us
to nuke the forward threader entirely. But for now, a more gradual
approach is needed.
I have patches to the path solver that address all the shortcomings it
had in my initial rewrite. That is, the ability to use the ranger and
the relation oracle to fully resolve ranges and relations not known
within the path. With this enhanced solver, we are able to get
everything the forward threader can get (both VRP and DOM clients, with
the exception of floats for DOM). This path solver can then be used
either with the backward threader (disabled by default), or with with
the forward threader in the hybrid approach I propose for VRP.
I'd like to discuss how I tested this, what is missing, and alternatives
on going forward.
SUMMARY
=======
The hybrid threader gets 14.5% more jump threads than the legacy code,
but most of these threads are at the expense of other threading passes.
The more it gets, the less DOM and the backward threader get. That
being said, there is a total improvement of 0.87% jump threads in the
total compilation.
This comes with no runtime penalty. In our callgrind testing harness
derived from the .ii files from a bootstrap, I measured a 0.62%
performance drop, well within the noise of the tool.
However, there is a 1.5% performance penalty for splitting the VRP
threader from outside of VRP (for either hybrid or legacy). I would
prefer divorcing embedded jump threading passes from their carriers, but
if others disagree, we could piggy back on the ranger used in the
upcoming VRP replacement (evrp has a ranger we could share).
TESTING
=======
The presence of ASSERT_EXPRs made it difficult to do a clean comparison,
as ranger obviously doesn't use these. What I did was move the VRP
threader into its own pass (regenerating ASSSERT_EXPRs and its
environment), and then run my hybrid threader iteratively until no more
changes (excluding ping-pongs). Then I ran the legacy code, and
anything it could find, was something worth investigating.
BTW, the reason for the iterative approach, is because any threading
pass creates opportunities for subsequent threading passes.
MISSING CASES
=============
There were a few things we missed, which can be summarized in broad
categories:
a) A few missing range-op entries for things like RROTATE_EXPR. These
are trivial to address.
b) Some paths which the solver could clearly solve, but it never got the
chance, because of limitations in the path discovery code in the forward
threader.
I fixed a few of these (upcoming patches), but I mostly avoided fixing
the core forward threader code, as it would incur too much work for very
little return. Besides, my plan is to eventually nuke it.
One example are paths whose first block ends in a conditional, but whose
second is empty. These are usually empty loop pre-headers, which are
not empty in legacy mode because the ASSERT_EXPR mechanism has put
asserts in there. However, since the ranger uses the pristine IL, these
blocks are empty, which throws off the threader. In practice, this
didn't matter, because since the CFG had been cleaned up, these empty
blocks were predominantly loop pre-headers which the threader was trying
to thread through, crossing loop boundaries in the process. As has been
discussed earlier, we shouldn't be threading through this.
c) The path solver returns UNDEFINED for unreachable paths, so we avoid
threading them altogether. However, the legacy code is perfectly happy
to give answers and thread through the paths.
d) I saw an odd missed propagation/recalculation here and there, where
ranger returned varying, when it could've done better. This was very rare.
e) Finally, conditionals when either operand is a MEM_REF are not
handled. Interestingly, this was uncommon as most everything had been
simplified to an SSA.
As the numbers show, the above is probably noise. The only thing worth
addressing may be the MEM_REF business. If this becomes a problem, it
is precisely what we designed the pointer equivalence tracking we use in
evrp. It could be easily adapted.
PATH FORWARD
============
I have various patchsets to the path solver, the forward threader, and
VRP to make all this happen, but would like to get feedback on how to
proceed.
I suggest we replace the VRP threader before replacing VRP[12] with
evrp[23] to minimize the disruption. The hybrid threader can easily run
after current VRP[12] as it stands.
I lean towards putting this post-VRP threader in its own pass, instead
of keeping it embedded in VRP. After all, this was one of the original
motivations for the ranger work, to rip out all the sub-passes that had
accumulated inside VRP over the years. But I do understand if the 1.5%
penalty is hard to swallow. In which case, we could move the hybrid
threader to its own file, but sharing the ranger used by evrp[23] when
the time comes.
That being said, when the threaders' cost models are merged next year,
we could shuffle passes easier. There's no reason to thread before
*and* after VRP, when the backward threader should be able to handle
everything. We should be able to get back any lost time, and then some,
by reducing the amount of jump threading passes.
One more thing, would it be preferred that I include a
--param=vrp-threader=[legacy|hybrid] parameter along with the patch? I
have such thing in my tree, but last time around, the ability to run in
legacy was mostly a distraction that had to be removed immediately after.
I will be contributing/pushing various patches over the next few days
that are improvements orthogonal to this work (for example, improvements
to the forward threader and the path solver). But I'd like some
feedback on the VRP threader bits before I post those.
Thanks.
Aldy