Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-14 Thread Segher Boessenkool
On Mon, Sep 12, 2016 at 10:49:46AM -0600, Jeff Law wrote:
> On 09/09/2016 02:56 PM, Segher Boessenkool wrote:
> >On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
> >>I think the lack of test coverage is something we'll want to address.
> >
> >Building and running the compiler, the various target libraries, and the
> >testsuite is more than enough coverage for correctness in my opinion --
> >I cannot make up anything that is not already in there.  No doubt some
> >PR will show up later, and we can (and should) of course add that then.
> >
> >For checking whether it makes "smart" decisions, I'll make some really
> >simple testcases that will not fail for random reasons all the time.
> I'm moving more and more towards wanting tests for new features. 
> There's no reason why verification of basic behavior shouldn't be part 
> of the feature patch itself.

Yes, but as I said, this is used millions of times during bootstrap
already, and at least *I* cannot think of more complex structures that
would fail.  If any show up we can add them as regression tests.

> I suspect the biggest problem with testing something like separate 
> shrink wrapping is making sure the test is stable over time -- and 
> that's a legitimate concern.

Yes exactly.

> But it ought to be do-able in one form or another,

Yes, as I promised, I'll make up some trivial tests.

> particularly since we have control over the debug dumps. 

I'm not a big fan of using the debug dumps in the testsuite -- very
often tests break because you changed something in the dump.  Debug
dumps are for debug, not for testing.

If we get to the point where we generically can use some piece of RTL
as input and just run one pass on it, that would be nice.  But we're
not there yet, and the kind of "unit tests" where you just write all
code twice and essentially only test if the test is correct: no thanks.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-12 Thread Jeff Law

On 09/09/2016 02:56 PM, Segher Boessenkool wrote:

On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:

I think the lack of test coverage is something we'll want to address.


Building and running the compiler, the various target libraries, and the
testsuite is more than enough coverage for correctness in my opinion --
I cannot make up anything that is not already in there.  No doubt some
PR will show up later, and we can (and should) of course add that then.

For checking whether it makes "smart" decisions, I'll make some really
simple testcases that will not fail for random reasons all the time.
I'm moving more and more towards wanting tests for new features. 
There's no reason why verification of basic behavior shouldn't be part 
of the feature patch itself.


I suspect the biggest problem with testing something like separate 
shrink wrapping is making sure the test is stable over time -- and 
that's a legitimate concern.  But it ought to be do-able in one form or 
another, particularly since we have control over the debug dumps. 
Alternately building something on top of Bernd's work is an option too.


Building the compiler & libraries, regression testing, etc are all 
important as well.


jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-12 Thread Bernd Schmidt

On 09/09/2016 10:56 PM, Segher Boessenkool wrote:

On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:

I think the lack of test coverage is something we'll want to address.


Building and running the compiler, the various target libraries, and the
testsuite is more than enough coverage for correctness in my opinion --


I'd argue against that:
 - a lot of compiler bugs for passes like this show up only for unusual
   CFGs that you don't really get with normal input code.
 - separate shrink wrapping explicitly tries to modify infrequent
   paths, which may not even get exercised.

Also, unless I've missed it you've not provided any numbers about how 
often this triggers, let's say on a cc1 build.


For these reasons I thought having self-tests would be a good idea, but 
I was informed it's a waste of time, so I guess not.



Bernd


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
> I think the lack of test coverage is something we'll want to address.

Building and running the compiler, the various target libraries, and the
testsuite is more than enough coverage for correctness in my opinion --
I cannot make up anything that is not already in there.  No doubt some
PR will show up later, and we can (and should) of course add that then.

For checking whether it makes "smart" decisions, I'll make some really
simple testcases that will not fail for random reasons all the time.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Fri, Sep 09, 2016 at 12:19:03PM -0600, Jeff Law wrote:
> >>Does this impact the compile time computation complexity issue that was
> >>raised elsewhere?
> >
> >I'm not sure what you mean here either, sorry.  It is all O(NM) with N
> >the number of BBs and M the number of components (and assuming dom
> >lookups are constant time, as usual).
> You said in a different message that computing optimal placement points 
> for prologue/epilogue components was not computationally feasible.   I'm 
> just trying to figure out if the switch to utilizing block frequencies 
> is a part of what makes solving that problem infeasible.

Ah, I see.  Allowing multiple prologues (and epilogues) is what makes
it infeasible.  If there is just (at most) one prologue (per component),
calculating the optimal placement is of course linear in # BBs, given
that the cost function is O(1) (or sort of kind of; linear in # edges
really, if you have to insist :-) )

If you allow multiple prologues, i.e. allow any combination of blocks
to run with or without some component "active", you get an exponential
number of possible way to place things, and the cost for those combinations
is *not* an ordered function: if all predecessors of a block have some
component active, then this block itself does not need a prologue.

I get around that by making the cost function ordered, that is, possibly
overestimating the cost of nodes that are the dest of a cross-jump; in
the first version of the patch, by always using the execution frequency
of a node (so, not considering that a prologue there does not cost
anything for edges where the predecessor already has that component
active); and in the second version of the patch, that, but do subtract
the cost from backedges (which makes it clearer that loops are handled
correctly, because the loop header generally has lower cost than the
nodes in the loop body).


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 08/26/2016 07:03 AM, Bernd Schmidt wrote:

On 08/01/2016 03:42 AM, Segher Boessenkool wrote:

This is the second version.  Concern was renamed to component, and all
other comments were addressed (I hope).


Not really, I'm afraid. There still seems to be no detailed explanation
of the algorithm. Instead, there is a vague outline (which should be
expanded to at least the level of detail found e.g. in tree-ssa-pre.c),
and inside the code there are still meaningless comments of the form
I think Segher and I can work towards this. The placement algorithm is 
where I think the main focus will need to be.  Segher has some text for 
that, but we may need to supplement with a CFG or two to help clarify.




/* Deselect those epilogue components that should not be inserted
   for this edge.  */

which don't tell me anything about what the intention is (why should
they not be inserted?). The result is that as a reader, I still find
myself unable to determine whether the algorithm is correct or not.
Deselection is more about pruning components that we initially thought 
were separate shrink wrap candidates, but upon deeper inspection we can 
not handle.  And it's all target driven.


Deselection has to run after placement computation because deselection 
is a property of the desired insertion site (which is why this hook 
passes in the desired edge where we want to insert code).  Deselection 
runs prior to actual generation & insertion of the sequences. 
Deselection affects the ability to separately shrink wrap that component 
globally.


So it does affect correctness, but it's pretty easy to see that its safe.





Worse, I'm still not entirely certain what it is intended to achieve: I
asked for a motivating example or two, but haven't seen one in the
comments anywhere. My most general question would be why the algorithm
for placing individual prologue components would be different from the
regular shrink-wrapping algorithm for full prologues. Examples and/or
arguments relating to how this new code acts with regard to loops are
also particularly needed.
I think Segher and Michael covered this in the various follow-ups.  I'm 
now comfortable with the overall goals.


Right now we have a single copy of the prologue/epilogue.  That single 
copy might get shrink-wrapped, but at the end of the day, it's still a 
single copy that gets executed at most once.


If we relax the single copy and single execution traits, then we get 
more freedom to place the prologues/epilogues and less frequently 
executed points.


We get even more flexibility by handling components of the 
prologue/epilogue separately.


Some comments in these areas will be appropriate, but again, I'm 
comfortable with the overall direction this work has gone.





So, I think improvement is necessary in these points, but I also think
that there's room for experimental verification by way of self-tests. If
done thoroughly (testing the transformation on several sufficiently
random CFGs and maybe some manually constructed ones) it would go a long
way towards showing that at least we don't have to worry too much about
miscompilations. That's what I've been working on, and an initial patch
is below. This is incomplete and posted more as an RFD since we're
getting towards the end of the week: there are gaps in the test
coverage, and it currently fails the selftest. I assume the latter is a
problem with my code, but it wouldn't hurt if you could take a look;
maybe I misunderstood something entirely about what the separate
shrink-wrapping is supposed to achieve, or maybe I messed up the
algorithm with my changes.

I think the lack of test coverage is something we'll want to address.

jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 09/09/2016 09:40 AM, Segher Boessenkool wrote:


So I think sticking with this as a design decision makes sense -- does
it impact the issue around running a particular component's prologue
more than once?


I don't follow, sorry; could you rephrase?
Nevermind -- my question has been resolved.  I was fumbling around 
trying to figure out the circumstances where a prologue/epilogue might 
be run more than once.  With your recent message, that's been cleared up.



Right and the cases where it's placing them into loops it's going to be
doing so on paths that are unlikely executed within the loop.  ISTM that
using frequencies is a significant step forward over the older placement
algorithms in this space (which IIRC were structured as simple dataflow
solvers) -- with the added advantage that if we have profiling data we
can make better decisions.


Using the known/guessed execution frequencies is not really new, just not
forty years old :-)

Referring to its use in GCC (or lack thereof).




Does this impact the compile time computation complexity issue that was
raised elsewhere?


I'm not sure what you mean here either, sorry.  It is all O(NM) with N
the number of BBs and M the number of components (and assuming dom
lookups are constant time, as usual).
You said in a different message that computing optimal placement points 
for prologue/epilogue components was not computationally feasible.   I'm 
just trying to figure out if the switch to utilizing block frequencies 
is a part of what makes solving that problem infeasible.


jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 09/09/2016 10:57 AM, Segher Boessenkool wrote:

On Fri, Sep 09, 2016 at 10:48:30AM -0600, Jeff Law wrote:

and even allows them to be executed more than once, if that is

cheaper.

This is the part that I'm still struggling with.


The usual example:

1
|\
| \
|  2
| /
|/
3
|\
| \
|  4
| /
|/
5

where 2 and 4 need a certain prologue component (and the rest doesn't).

Perfect.

So this is consistent with one of the ideas I was starting to form.  I'm 
going to stay in the PRE world because it's model is so damn close to 
what you're doing.


PRE minimizes expression evaluations on paths *without* introducing 
evaluations on paths that didn't already have one.


So if we pretend we had some expression evaluated in 2 & 4.  The path 
1->2->3->4 has two evaluations of the expression, but PRE can't really 
do anything here.  We can't hoist the evaluation into a better spot as 
doing so would introduce evaluations on paths that didn't have one 
before.  2 & 4 are the proper locations for that expression evaluation.


And the same applies to separate shrink wrapping.  2 & 4 are the right 
place.  It's all so clear now.


I'll note that duplicating 3 into 3' and redirecting the edge 2->3 to 
2->3' allows us to do better PRE and prologue insertion.  But I don't 
think that's a requirement for you to go forward :-)


Anyway, it's all clear now.  Thanks so much for that example.


Jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 09/09/2016 10:49 AM, Segher Boessenkool wrote:

On Fri, Sep 09, 2016 at 09:59:03AM -0600, Jeff Law wrote:

On 09/09/2016 09:17 AM, Segher Boessenkool wrote:

On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:

So can you expand on the malloc example a bit -- I'm pretty sure I
understand what you're trying to do, but a concrete example may help
Bernd and be useful for archival purposes.


Sure, but it's big (which is the problem :-) )

Yea :(  But it's likely a very compelling example of real world code.
It's almost certainly too big to turn into a testcase of any kind, but
just some before/after annotated code would be helpful.


I'll work on it, but it won't be before tonight, probably quite a bit
later.  Seeing the difference in generated machine code is probably
most useful?  Better than RTL ;-)

Yea, machine code is probably most useful.




Ideally we'd have some smaller testcases we could put in the testsuite
to ensure that the feature works over-time in the way intended would be
helpful as well.


I might be able to make some really tiny testcases that will not need
updates all of the time.  We'll see.
Understood, particularly on the maintenance burden for these kind of 
tests.  The alternate approach might be to pick up Bernd's work and see 
if we can use that test the various components.




But that is not because it is good to have only one!  GCC expects there
to be only one, instead.  Some ports might even use prologues that cannot
be duplicated at all.

Agreed on all points.




And in separate shrink-wrapping world, we're leaving that model behind
and I think that's one of the big things I'm struggling with -- we may
execute a prologue component more than once if I've read everything
correctly.


Yes, and that is perhaps radically new in the GCC world, but not anywhere
else.
And just to be clear I think I'm mostly looking for why multiple 
execution of a particular component is useful.  My first thought in that 
case is that we've got a redundancy and that the redundancy should be 
identified and eliminated :-)


I've got a few not-well-formed ideas in my head about when that might 
happen and why we might not want to squash out the redundancy.  But it'd 
be a hell of a step forward to see that case in action.


Jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Fri, Sep 09, 2016 at 10:48:30AM -0600, Jeff Law wrote:
> and even allows them to be executed more than once, if that is
> >cheaper.
> This is the part that I'm still struggling with.

The usual example:

1
|\
| \
|  2
| /
|/
3
|\
| \
|  4
| /
|/
5

where 2 and 4 need a certain prologue component (and the rest doesn't).

If the frequency of 2 plus that of 4 is less than the frequency of 1,
it is better to place a prologue (and epilogue) around both 2 and 4
than to place a prologue at 1 and an epilogue at 5.

> >You can also put this in a loop.  I'll try to make a simple example
> >with that.
> That would be great.  And a case where they execute more than once too.

Okido.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Fri, Sep 09, 2016 at 09:59:03AM -0600, Jeff Law wrote:
> On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
> >On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
> >>So can you expand on the malloc example a bit -- I'm pretty sure I
> >>understand what you're trying to do, but a concrete example may help
> >>Bernd and be useful for archival purposes.
> >
> >Sure, but it's big (which is the problem :-) )
> Yea :(  But it's likely a very compelling example of real world code. 
> It's almost certainly too big to turn into a testcase of any kind, but 
> just some before/after annotated code would be helpful.

I'll work on it, but it won't be before tonight, probably quite a bit
later.  Seeing the difference in generated machine code is probably
most useful?  Better than RTL ;-)

> Ideally we'd have some smaller testcases we could put in the testsuite 
> to ensure that the feature works over-time in the way intended would be 
> helpful as well.

I might be able to make some really tiny testcases that will not need
updates all of the time.  We'll see.

> Right.  That's always been my understanding of the key driver for 
> placement.  There's exactly one and will be executed one time or none 
> across all paths in the CFG.

But that is not because it is good to have only one!  GCC expects there
to be only one, instead.  Some ports might even use prologues that cannot
be duplicated at all.

> And in separate shrink-wrapping world, we're leaving that model behind 
> and I think that's one of the big things I'm struggling with -- we may 
> execute a prologue component more than once if I've read everything 
> correctly.

Yes, and that is perhaps radically new in the GCC world, but not anywhere
else.

[ ... ]

> This really feels comparable to block duplication for the purposes of 
> isolating a particular path through the CFG so that path can be modified 
> without affecting the behavior of other paths through the CFG.

Yes, very much.

> It's also directly comparable to block duplication to allow more 
> aggressive code motion in PRE-like algorithms.

Yeah.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 09/09/2016 09:28 AM, Segher Boessenkool wrote:

Segher's code essentially allows individual components of the prologue
to sink to different points within the function rather than forcing the
prologue to be sunk as an atomic unit.


It also allows prologue an epilogue components to be placed in multiple
places,

Right.  I need to keep that in the forefront of my mind.



and even allows them to be executed more than once, if that is

cheaper.

This is the part that I'm still struggling with.




The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,


Ok, so why wouldn't we use the existing code with the duplication part
disabled? That's a later addition anyway and isn't necessary to do
shrink-wrapping in the first place.

I think the concern here is the balance between code explosion and the
runtime gains.


The code explosion can be terrible if you have only a few components,
already.  The runtime gains are not so great either.  Here's a simple
example, showing part of a cfg, the exits to the right are calls to
abort and need some prologue component:

|
1
|\
| \
2  X1
|\
| \
|  X2
|

  3


With the "old" algorithm, you need to place the prologue at 1 (because
you can only have one), and then the fall-through path also necessarily
gets that component's prologue.  With the "separate" algorithm, the
component prologues can be placed at X1 and X2 instead (and that will
most likely be cheapest according to the cost model, too).
Thanks.  That really helps highlight some of the key differences between 
the old and new model.





You can also put this in a loop.  I'll try to make a simple example
with that.

That would be great.  And a case where they execute more than once too.

Jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 09/09/2016 09:17 AM, Segher Boessenkool wrote:

On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:

So can you expand on the malloc example a bit -- I'm pretty sure I
understand what you're trying to do, but a concrete example may help
Bernd and be useful for archival purposes.


Sure, but it's big (which is the problem :-) )
Yea :(  But it's likely a very compelling example of real world code. 
It's almost certainly too big to turn into a testcase of any kind, but 
just some before/after annotated code would be helpful.


Ideally we'd have some smaller testcases we could put in the testsuite 
to ensure that the feature works over-time in the way intended would be 
helpful as well.





That's a later addition anyway and isn't necessary to do
shrink-wrapping in the first place.


No, it always did that, just not as often (it only duplicated straight-line
code before).

Presumably (I haven't looked yet), the duplication is so that we can
isolate one or more paths which in turn allows sinking the prologue
further on some of those paths.


It duplicates as many blocks as it needs to dup, to make as many exits
as possible reachable without *any* prologue/epilogue.

As the header comment before the older code says:

/* Try to perform a kind of shrink-wrapping, making sure the
   prologue/epilogue is emitted only around those parts of the
   function that require it.

   There will be exactly one prologue, and it will be executed either
   zero or one time, on any path.
Right.  That's always been my understanding of the key driver for 
placement.  There's exactly one and will be executed one time or none 
across all paths in the CFG.


Essentially this is comparable to PRE-like algorithms for placement of 
expression evaluations.


And in separate shrink-wrapping world, we're leaving that model behind 
and I think that's one of the big things I'm struggling with -- we may 
execute a prologue component more than once if I've read everything 
correctly.



 Depending on where the prologue is

   placed, some of the basic blocks can be reached via both paths with
   and without a prologue.  Such blocks will be duplicated here, and the
   edges changed to match.

Understood.

This really feels comparable to block duplication for the purposes of 
isolating a particular path through the CFG so that path can be modified 
without affecting the behavior of other paths through the CFG.


It's also directly comparable to block duplication to allow more 
aggressive code motion in PRE-like algorithms.


Jeff



Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Fri, Sep 09, 2016 at 09:26:39AM -0600, Jeff Law wrote:
> >>I think one of the questions (and I haven't looked through the whole
> >>thread yet to see if it's answered) is why the basic shrink-wrapping
> >>algorithm can't be applied to each of the prologue components -- though
> >>you may have answered it with your loop example below.
> >
> >You get code size explosion with the "basic" algorithm.  And a lot of
> >that isn't really worth it: avoiding the whole prologue/epilogue is
> >usually worth copying some blocks for, but avoiding just a single register
> >save/restore pair?  More doubtful.
> Understood.   We have similar balancing problems elsewhere.  How much 
> duplication should we allow to expose a CSE or eliminate a conditional 
> jump on a path through the CFG.
> 
> So I think sticking with this as a design decision makes sense -- does 
> it impact the issue around running a particular component's prologue 
> more than once?

I don't follow, sorry; could you rephrase?

> >>Thanks.  I'd been wondering about when it'd be useful to push prologue
> >>code into a loop nest when I saw the patches fly by, but didn't think
> >>about it too much.  I haven't looked at the shrink-wrapping literature
> >>in years, but I don't recall it having any concept that there were cases
> >>where sinking into a loop nest was profitable.
> >
> >It isn't common, but it does happen.  If you use a proper cost metric
> >based on executiuon frequency with some algorithm then that algorithm will
> >naturally avoid placing *logues into loops.
> Right and the cases where it's placing them into loops it's going to be 
> doing so on paths that are unlikely executed within the loop.  ISTM that 
> using frequencies is a significant step forward over the older placement 
> algorithms in this space (which IIRC were structured as simple dataflow 
> solvers) -- with the added advantage that if we have profiling data we 
> can make better decisions.

Using the known/guessed execution frequencies is not really new, just not
forty years old :-)

> Does this impact the compile time computation complexity issue that was 
> raised elsewhere?

I'm not sure what you mean here either, sorry.  It is all O(NM) with N
the number of BBs and M the number of components (and assuming dom
lookups are constant time, as usual).


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Thu, Sep 08, 2016 at 10:58:13AM -0600, Jeff Law wrote:
> >And that comment puzzles me. Surely prologue and epilogue are executed
> >only once currently, so how does frequency come into it? Again - please
> >provide an example.
> Right, they're executed once currently.  But the prologue could be sunk 
> into locations where they are not executed every time the function is 
> called.  That's the basis behind shrink wrapping.

Right.

> Segher's code essentially allows individual components of the prologue 
> to sink to different points within the function rather than forcing the 
> prologue to be sunk as an atomic unit.

It also allows prologue an epilogue components to be placed in multiple
places, and even allows them to be executed more than once, if that is
cheaper.

> >>The full-prologue algorithm makes as many blocks run without prologue as
> >>possible, by duplicating blocks where that helps.  If you do this for
> >>every component you can and up with 2**40 blocks for just 40 components,
> >
> >Ok, so why wouldn't we use the existing code with the duplication part
> >disabled? That's a later addition anyway and isn't necessary to do
> >shrink-wrapping in the first place.
> I think the concern here is the balance between code explosion and the 
> runtime gains.

The code explosion can be terrible if you have only a few components,
already.  The runtime gains are not so great either.  Here's a simple
example, showing part of a cfg, the exits to the right are calls to
abort and need some prologue component:

|
1
|\
| \
2  X1
|\
| \
|  X2
|

With the "old" algorithm, you need to place the prologue at 1 (because
you can only have one), and then the fall-through path also necessarily
gets that component's prologue.  With the "separate" algorithm, the
component prologues can be placed at X1 and X2 instead (and that will
most likely be cheapest according to the cost model, too).

You can also put this in a loop.  I'll try to make a simple example
with that.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Jeff Law

On 09/09/2016 12:19 AM, Segher Boessenkool wrote:

Thanks for looking at the patches Jeff.

On Thu, Sep 08, 2016 at 10:28:59AM -0600, Jeff Law wrote:

Right.  Essentially Segher's patch introduces the concept of prologue
components that are independent of each other and which can be
shrink-wrapped separately.  The degree of independence is highly target
specific, of course.


There can be dependencies as well (e.g. a save to the frame requires that
frame to be set up already), but the current patches have no way to
describe such dependencies.  I haven't found a good way to describe the
dependencies yet.  Finding a good balance between general and useful isn't
easy, as usual.
Right.  I over-simplified a bit here.  Some dependencies can be handled 
via the hooks in the target files.  I think what we've got is sufficient 
for now -- we'll have a much clearer picture of weak spots in the design 
if/when someone enables separate shrink wrapping for another target. 
Until then, I think we're OK.





I think one of the questions (and I haven't looked through the whole
thread yet to see if it's answered) is why the basic shrink-wrapping
algorithm can't be applied to each of the prologue components -- though
you may have answered it with your loop example below.


You get code size explosion with the "basic" algorithm.  And a lot of
that isn't really worth it: avoiding the whole prologue/epilogue is
usually worth copying some blocks for, but avoiding just a single register
save/restore pair?  More doubtful.
Understood.   We have similar balancing problems elsewhere.  How much 
duplication should we allow to expose a CSE or eliminate a conditional 
jump on a path through the CFG.


So I think sticking with this as a design decision makes sense -- does 
it impact the issue around running a particular component's prologue 
more than once?




Thanks.  I'd been wondering about when it'd be useful to push prologue
code into a loop nest when I saw the patches fly by, but didn't think
about it too much.  I haven't looked at the shrink-wrapping literature
in years, but I don't recall it having any concept that there were cases
where sinking into a loop nest was profitable.


It isn't common, but it does happen.  If you use a proper cost metric
based on executiuon frequency with some algorithm then that algorithm will
naturally avoid placing *logues into loops.
Right and the cases where it's placing them into loops it's going to be 
doing so on paths that are unlikely executed within the loop.  ISTM that 
using frequencies is a significant step forward over the older placement 
algorithms in this space (which IIRC were structured as simple dataflow 
solvers) -- with the added advantage that if we have profiling data we 
can make better decisions.


Does this impact the compile time computation complexity issue that was 
raised elsewhere?


jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
> So can you expand on the malloc example a bit -- I'm pretty sure I 
> understand what you're trying to do, but a concrete example may help 
> Bernd and be useful for archival purposes.

Sure, but it's big (which is the problem :-) )

> I also know that Carlos is interested in the malloc example -- so I'd 
> like to be able to pass that along to him.
> 
> Given the multiple early exit and fast paths through the allocator, I'm 
> not at all surprised that sinking different components of the prologue 
> to different locations is useful.
> 
> Also if there's a case where sinking into a loop occurs, definitely 
> point that out.

Not sure that happens in there, I'll find out.

> >>That's a later addition anyway and isn't necessary to do
> >>shrink-wrapping in the first place.
> >
> >No, it always did that, just not as often (it only duplicated straight-line
> >code before).
> Presumably (I haven't looked yet), the duplication is so that we can 
> isolate one or more paths which in turn allows sinking the prologue 
> further on some of those paths.

It duplicates as many blocks as it needs to dup, to make as many exits
as possible reachable without *any* prologue/epilogue.

As the header comment before the older code says:

/* Try to perform a kind of shrink-wrapping, making sure the
   prologue/epilogue is emitted only around those parts of the
   function that require it.

   There will be exactly one prologue, and it will be executed either
   zero or one time, on any path.  Depending on where the prologue is
   placed, some of the basic blocks can be reached via both paths with
   and without a prologue.  Such blocks will be duplicated here, and the
   edges changed to match.

   Paths that go to the exit without going through the prologue will use
   a simple_return instead of the epilogue.  We maximize the number of
   those, making sure to only duplicate blocks that can be duplicated.
   If the prologue can then still be placed in multiple locations, we
   place it as early as possible.


> This is something I'll definitely want to look at -- block duplication 
> to facilitate code elimination (or in this case avoid code insertion) 
> hits several areas of interest to me -- and how we balance duplication 
> vs runtime savings is always interesting.

Yeah.  If you use separate shrink-wrapping you still *also* get the
"old" algorithm, if that wasn't clear.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-09 Thread Segher Boessenkool
Thanks for looking at the patches Jeff.

On Thu, Sep 08, 2016 at 10:28:59AM -0600, Jeff Law wrote:
> Right.  Essentially Segher's patch introduces the concept of prologue 
> components that are independent of each other and which can be 
> shrink-wrapped separately.  The degree of independence is highly target 
> specific, of course.

There can be dependencies as well (e.g. a save to the frame requires that
frame to be set up already), but the current patches have no way to
describe such dependencies.  I haven't found a good way to describe the
dependencies yet.  Finding a good balance between general and useful isn't
easy, as usual.

> I think one of the questions (and I haven't looked through the whole 
> thread yet to see if it's answered) is why the basic shrink-wrapping 
> algorithm can't be applied to each of the prologue components -- though 
> you may have answered it with your loop example below.

You get code size explosion with the "basic" algorithm.  And a lot of
that isn't really worth it: avoiding the whole prologue/epilogue is
usually worth copying some blocks for, but avoiding just a single register
save/restore pair?  More doubtful.

> >That includes moving parts of the prologue into loops:
> >
> >int g() {
> >  int sum = 0;
> >  for (int i = 0; i < NUM; i++) {
> >sum += i;
> >if (sum >= CUTOFF) {
> >  some_long_winded_expression();
> >  that_eventually_calls_abort();
> >}
> >  }
> >  return sum;
> >}
> >
> >Here all parts of the prologue that somehow make it possible to call other
> >functions are necessary only when the program will abort eventually: hence
> >is necessary only at one call of g() at most.  Again it's sensible to move
> >those parts inside the unlikely condition, even though it's inside a loop.
> Thanks.  I'd been wondering about when it'd be useful to push prologue 
> code into a loop nest when I saw the patches fly by, but didn't think 
> about it too much.  I haven't looked at the shrink-wrapping literature 
> in years, but I don't recall it having any concept that there were cases 
> where sinking into a loop nest was profitable.

It isn't common, but it does happen.  If you use a proper cost metric
based on executiuon frequency with some algorithm then that algorithm will
naturally avoid placing *logues into loops.


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-08 Thread Jeff Law

On 08/26/2016 09:03 AM, Bernd Schmidt wrote:

On 08/26/2016 04:50 PM, Segher Boessenkool wrote:

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed
less
+   frequently.

and that is the long and short of it.


And that comment puzzles me. Surely prologue and epilogue are executed
only once currently, so how does frequency come into it? Again - please
provide an example.
Right, they're executed once currently.  But the prologue could be sunk 
into locations where they are not executed every time the function is 
called.  That's the basis behind shrink wrapping.


Segher's code essentially allows individual components of the prologue 
to sink to different points within the function rather than forcing the 
prologue to be sunk as an atomic unit.


Conceptually you could run the standard algorithm on each independent 
component.






The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,


Ok, so why wouldn't we use the existing code with the duplication part
disabled? That's a later addition anyway and isn't necessary to do
shrink-wrapping in the first place.
I think the concern here is the balance between code explosion and the 
runtime gains.


jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-08 Thread Jeff Law

On 08/26/2016 10:27 AM, Segher Boessenkool wrote:

On Fri, Aug 26, 2016 at 05:03:34PM +0200, Bernd Schmidt wrote:

On 08/26/2016 04:50 PM, Segher Boessenkool wrote:

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.

and that is the long and short of it.


And that comment puzzles me. Surely prologue and epilogue are executed
only once currently, so how does frequency come into it? Again - please
provide an example.


If some component is only needed for 0.01% of executions of a function,
running it once for every execution is 1 times too much.

The trivial example is a function that does an early exit, but uses one
or a few non-volatile registers before that exit.  This happens in e.g.
glibc's malloc, if you want an easily accessed example.  With the current
code, *all* components will be saved and then restored shortly afterwards.
So can you expand on the malloc example a bit -- I'm pretty sure I 
understand what you're trying to do, but a concrete example may help 
Bernd and be useful for archival purposes.


I also know that Carlos is interested in the malloc example -- so I'd 
like to be able to pass that along to him.


Given the multiple early exit and fast paths through the allocator, I'm 
not at all surprised that sinking different components of the prologue 
to different locations is useful.


Also if there's a case where sinking into a loop occurs, definitely 
point that out.





The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,


Ok, so why wouldn't we use the existing code with the duplication part
disabled?


That would not perform nearly as well.


That's a later addition anyway and isn't necessary to do
shrink-wrapping in the first place.


No, it always did that, just not as often (it only duplicated straight-line
code before).
Presumably (I haven't looked yet), the duplication is so that we can 
isolate one or more paths which in turn allows sinking the prologue 
further on some of those paths.


This is something I'll definitely want to look at -- block duplication 
to facilitate code elimination (or in this case avoid code insertion) 
hits several areas of interest to me -- and how we balance duplication 
vs runtime savings is always interesting.


Jeff



Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-09-08 Thread Jeff Law

On 08/30/2016 06:31 AM, Michael Matz wrote:

Hi,

On Fri, 26 Aug 2016, Bernd Schmidt wrote:


And that comment puzzles me. Surely prologue and epilogue are executed only
once currently, so how does frequency come into it? Again - please provide an
example.


int some_global;
int foo (void) {
  if (!some_global) {
call_this();
call_that();
x = some + large / computation;
while (x--) { much_more_code(); }
some_global = 1;
  }
  return some_global;
}

Now you need the full pro/epilogue (saving/restoring all call clobbered
regs presumably used by the large computation and loop) for only one call
of that function (the first), and then never again.  But you do need a
part of it always assuming the architecture is right, and this is a shared
library, namely the setup of the PIC pointer for the access to
some_global.




A prologue does many things, and in some cases many of them aren't
necessary for all calls (indeed that's often the case for functions
containing early-outs), so much so that the full prologue including
unnecessary parts needs more time than the functional body of a functions
particular call.  It's obvious that it would be better to move those parts
of the prologue to places where they actually are needed:

[ ... ]
Right.  Essentially Segher's patch introduces the concept of prologue 
components that are independent of each other and which can be 
shrink-wrapped separately.  The degree of independence is highly target 
specific, of course.


I think one of the questions (and I haven't looked through the whole 
thread yet to see if it's answered) is why the basic shrink-wrapping 
algorithm can't be applied to each of the prologue components -- though 
you may have answered it with your loop example below.





int f2 (void) {
  setup_pic_reg();
  if (!some_global) {
save_call_clobbers();
code_from_above();
restore_call_clobbers();
  }
  retreg = some_global;
  restore_pic_reg();
}

That includes moving parts of the prologue into loops:

int g() {
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
sum += i;
if (sum >= CUTOFF) {
  some_long_winded_expression();
  that_eventually_calls_abort();
}
  }
  return sum;
}

Here all parts of the prologue that somehow make it possible to call other
functions are necessary only when the program will abort eventually: hence
is necessary only at one call of g() at most.  Again it's sensible to move
those parts inside the unlikely condition, even though it's inside a loop.
Thanks.  I'd been wondering about when it'd be useful to push prologue 
code into a loop nest when I saw the patches fly by, but didn't think 
about it too much.  I haven't looked at the shrink-wrapping literature 
in years, but I don't recall it having any concept that there were cases 
where sinking into a loop nest was profitable.


Jeff


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-30 Thread Michael Matz
Hi,

On Fri, 26 Aug 2016, Bernd Schmidt wrote:

> And that comment puzzles me. Surely prologue and epilogue are executed only
> once currently, so how does frequency come into it? Again - please provide an
> example.

int some_global;
int foo (void) {
  if (!some_global) {
call_this();
call_that();
x = some + large / computation;
while (x--) { much_more_code(); }
some_global = 1;
  }
  return some_global;
}

Now you need the full pro/epilogue (saving/restoring all call clobbered 
regs presumably used by the large computation and loop) for only one call 
of that function (the first), and then never again.  But you do need a 
part of it always assuming the architecture is right, and this is a shared 
library, namely the setup of the PIC pointer for the access to 
some_global.

A prologue does many things, and in some cases many of them aren't 
necessary for all calls (indeed that's often the case for functions 
containing early-outs), so much so that the full prologue including 
unnecessary parts needs more time than the functional body of a functions 
particular call.  It's obvious that it would be better to move those parts 
of the prologue to places where they actually are needed:

int f2 (void) {
  setup_pic_reg();
  if (!some_global) {
save_call_clobbers();
code_from_above();
restore_call_clobbers();
  }
  retreg = some_global;
  restore_pic_reg();
}

That includes moving parts of the prologue into loops:

int g() {
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
sum += i;
if (sum >= CUTOFF) {
  some_long_winded_expression();
  that_eventually_calls_abort();
}
  }
  return sum;
}

Here all parts of the prologue that somehow make it possible to call other 
functions are necessary only when the program will abort eventually: hence 
is necessary only at one call of g() at most.  Again it's sensible to move 
those parts inside the unlikely condition, even though it's inside a loop.


Ciao,
Michael.


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-26 Thread Segher Boessenkool
On Fri, Aug 26, 2016 at 05:03:34PM +0200, Bernd Schmidt wrote:
> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
> >The head comment starts with
> >
> >+/* Separate shrink-wrapping
> >+
> >+   Instead of putting all of the prologue and epilogue in one spot, we
> >+   can put parts of it in places where those components are executed less
> >+   frequently.
> >
> >and that is the long and short of it.
> 
> And that comment puzzles me. Surely prologue and epilogue are executed 
> only once currently, so how does frequency come into it? Again - please 
> provide an example.

If some component is only needed for 0.01% of executions of a function,
running it once for every execution is 1 times too much.

The trivial example is a function that does an early exit, but uses one
or a few non-volatile registers before that exit.  This happens in e.g.
glibc's malloc, if you want an easily accessed example.  With the current
code, *all* components will be saved and then restored shortly afterwards.

> >The full-prologue algorithm makes as many blocks run without prologue as
> >possible, by duplicating blocks where that helps.  If you do this for
> >every component you can and up with 2**40 blocks for just 40 components,
> 
> Ok, so why wouldn't we use the existing code with the duplication part 
> disabled?

That would not perform nearly as well.

> That's a later addition anyway and isn't necessary to do 
> shrink-wrapping in the first place.

No, it always did that, just not as often (it only duplicated straight-line
code before).


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-26 Thread Bernd Schmidt

On 08/26/2016 04:50 PM, Segher Boessenkool wrote:

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.

and that is the long and short of it.


And that comment puzzles me. Surely prologue and epilogue are executed 
only once currently, so how does frequency come into it? Again - please 
provide an example.



The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,


Ok, so why wouldn't we use the existing code with the duplication part 
disabled? That's a later addition anyway and isn't necessary to do 
shrink-wrapping in the first place.



Bernd


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-26 Thread Segher Boessenkool
Hi Bernd,

On Fri, Aug 26, 2016 at 03:03:43PM +0200, Bernd Schmidt wrote:
> Not really, I'm afraid. There still seems to be no detailed explanation 
> of the algorithm. Instead, there is a vague outline

Did you read the description of 8/9?  If you think any of that needs to
be in the code, please just say so.

> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
>for this edge.  */

You asked for a comment here, see
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00932.html
("what's the purpose of this", etc.)

> Worse, I'm still not entirely certain what it is intended to achieve: I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere.

"Make faster code", like all other optimisation passes do as well.

I commented repeatedly about (micro-)benchmark results.

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.

and that is the long and short of it.

> My most general question would be why the algorithm 
> for placing individual prologue components would be different from the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or 
> arguments relating to how this new code acts with regard to loops are 
> also particularly needed.

The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,
not such a hot plan.  The "separate" algo does not duplicate blocks at all
(its only code growth is for all the prologue/epilogue components inserted,
and for some forwarder blocks sometimes).

The full-prologue algorithm only ever inserts exactly one prologue, far
from optimal.  But it *cannot* duplicate it with the semantics we have.
The separate components can of course be duplicated, it's a new abstraction
and part of the requirements for it.

> So, I think improvement is necessary in these points, but I also think 
> that there's room for experimental verification by way of self-tests.

Since it is used everywhere I think that is a big waste of time (it is
tested a lot already).  Testing specific problem cases can of course help;
but we have "make check" for that already anwyay. Also, I think "self-tests"
looking at the internals are just wrong (the only sane way to update such
tests when changing the code is to delete the tests, since they should be
written independently; otherwise you just have two copies of the same).  And
the useless abstractions are just useless.

The algorithm is procedural; writing it in procedural style is much clearer
than hiding everything.

> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a long 
> way towards showing that at least we don't have to worry too much about 
> miscompilations.

If you hadn't noticed, the algorithm is constructed in such a way as to
minimise possible miscompilations: it first determines what blocks need
what components "active", and then makes it so, in a separate (much more
trivial) phase.

Wrt your patch...  making each component needed half of the time is not
very realistic, you'll end up with all components active in most blocks.

bools as parameters where they do not mean "yes/no" is confusing.

It seems you do no longer do the "insert at head" and "insert at tail"
before the "insert on edge"?  This cannot work afais.

Some utility funcs print dump info; the caller should, instead.

You make "components" global (as "m_components").


Segher


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-26 Thread Bernd Schmidt

On 08/26/2016 03:48 PM, David Malcolm wrote:

I'm nervous about the build_random_cfg function: randomness in
selftests seems like a double-edged sword.  On the one hand, we can use
it to fuzz-test an optimization to rapidly gain a lot of coverage.  On
the other hand, does every host generate the same sequence of values?
Presumably we want everyone to be effectively running the same
selftests.


I intended to put a srandom call in there at the start but forgot. One 
could argue that there's value in having different tests on different 
hosts, as long as they are stable between runs, but...



Maybe there's a need for some kind of selftest::rng class here?


That might be a good idea too.


On a unrelated note, should the various vfunc implementations be marked
with "FINAL OVERRIDE" (from coretypes.h), so that the compiler has a
better chance of devirtualizing them in a release build?


Wasn't aware of that - will have a look.

Maybe we could also use a long-running "-fselftest=extensive" for 
stress-testing algorithms such as this (increasing the number of random 
cfgs it tries)?



Bernd


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-26 Thread David Malcolm
On Fri, 2016-08-26 at 15:03 +0200, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> > This is the second version.  Concern was renamed to component, and
> > all
> > other comments were addressed (I hope).
> 
> Not really, I'm afraid. There still seems to be no detailed
> explanation 
> of the algorithm. Instead, there is a vague outline (which should be 
> expanded to at least the level of detail found e.g. in tree-ssa
> -pre.c), 
> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
> for this edge.  */
> 
> which don't tell me anything about what the intention is (why should 
> they not be inserted?). The result is that as a reader, I still find 
> myself unable to determine whether the algorithm is correct or not.
> 
> Worse, I'm still not entirely certain what it is intended to achieve:
> I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere. My most general question would be why the
> algorithm 
> for placing individual prologue components would be different from
> the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or
> arguments relating to how this new code acts with regard to loops are
> also particularly needed.
> 
> So, I think improvement is necessary in these points, but I also
> think 
> that there's room for experimental verification by way of self-tests.
> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a
> long 
> way towards showing that at least we don't have to worry too much
> about 
> miscompilations. That's what I've been working on, and an initial
> patch 
> is below. This is incomplete and posted more as an RFD since we're 
> getting towards the end of the week: there are gaps in the test 
> coverage, and it currently fails the selftest. I assume the latter is
> a 
> problem with my code, but it wouldn't hurt if you could take a look; 
> maybe I misunderstood something entirely about what the separate 
> shrink-wrapping is supposed to achieve, or maybe I messed up the 
> algorithm with my changes.
> 

Bernd: I'm very happy to see someone else using the selftest framework.

(My mailer isn't letting me quote the patch body, sorry).

I'm nervous about the build_random_cfg function: randomness in
selftests seems like a double-edged sword.  On the one hand, we can use
it to fuzz-test an optimization to rapidly gain a lot of coverage.  On
the other hand, does every host generate the same sequence of values?
Presumably we want everyone to be effectively running the same
selftests.

Is this using libiberty's implementation of "random", or can it use the
underlying host libc implementation?  Are there any cross-platform
guarantees on the sequence of values that are returned for a particular
seed?  I don't want us to have host-specific failures that turn out to
be due to host differences in the RNG.

Do we need to re-seed the RNG before each test?  (or to rework
libiberty's random to bundle up the RNG state in a class, and use
that).  Otherwise, the meaning of the test could change if someone adds
another random-using selftest before this one.

Maybe there's a need for some kind of selftest::rng class here?

On a unrelated note, should the various vfunc implementations be marked
with "FINAL OVERRIDE" (from coretypes.h), so that the compiler has a
better chance of devirtualizing them in a release build?

Hope this is constructive
Dave


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-26 Thread Bernd Schmidt

On 08/01/2016 03:42 AM, Segher Boessenkool wrote:

This is the second version.  Concern was renamed to component, and all
other comments were addressed (I hope).


Not really, I'm afraid. There still seems to be no detailed explanation 
of the algorithm. Instead, there is a vague outline (which should be 
expanded to at least the level of detail found e.g. in tree-ssa-pre.c), 
and inside the code there are still meaningless comments of the form


/* Deselect those epilogue components that should not be inserted
   for this edge.  */

which don't tell me anything about what the intention is (why should 
they not be inserted?). The result is that as a reader, I still find 
myself unable to determine whether the algorithm is correct or not.


Worse, I'm still not entirely certain what it is intended to achieve: I 
asked for a motivating example or two, but haven't seen one in the 
comments anywhere. My most general question would be why the algorithm 
for placing individual prologue components would be different from the 
regular shrink-wrapping algorithm for full prologues. Examples and/or 
arguments relating to how this new code acts with regard to loops are 
also particularly needed.


So, I think improvement is necessary in these points, but I also think 
that there's room for experimental verification by way of self-tests. If 
done thoroughly (testing the transformation on several sufficiently 
random CFGs and maybe some manually constructed ones) it would go a long 
way towards showing that at least we don't have to worry too much about 
miscompilations. That's what I've been working on, and an initial patch 
is below. This is incomplete and posted more as an RFD since we're 
getting towards the end of the week: there are gaps in the test 
coverage, and it currently fails the selftest. I assume the latter is a 
problem with my code, but it wouldn't hurt if you could take a look; 
maybe I misunderstood something entirely about what the separate 
shrink-wrapping is supposed to achieve, or maybe I messed up the 
algorithm with my changes.



Bernd
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/sbitmap.c with-selftest/gcc/sbitmap.c
--- sw-selftest-base/gcc/sbitmap.c	2016-04-29 10:46:49.189700651 +0200
+++ with-selftest/gcc/sbitmap.c	2016-08-26 11:40:05.043374153 +0200
@@ -319,7 +319,7 @@
   *dstp++ = *ap++;
 }
 
-/* Return true if there are any bits set in A are also set in B.
+/* Return true if there are any bits set in A which are also set in B.
Return false otherwise.  */
 
 bool
@@ -335,6 +335,24 @@
   return true;
 
   return false;
+}
+
+/* Return true if there are any bits set in A which are not set in B.
+   Return false otherwise.  */
+
+bool
+bitmap_intersect_compl_p (const_sbitmap a, const_sbitmap b)
+{
+  const_sbitmap_ptr ap = a->elms;
+  const_sbitmap_ptr bp = b->elms;
+  unsigned int i, n;
+
+  n = MIN (a->size, b->size);
+  for (i = 0; i < n; i++)
+if ((*ap++ & ~*bp++) != 0)
+  return true;
+
+  return false;
 }
 
 /* Set DST to be (A and B).
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/sbitmap.h with-selftest/gcc/sbitmap.h
--- sw-selftest-base/gcc/sbitmap.h	2016-08-01 12:42:29.678435973 +0200
+++ with-selftest/gcc/sbitmap.h	2016-08-26 11:40:14.520040887 +0200
@@ -246,6 +246,7 @@
 extern bool bitmap_and_or (sbitmap, const_sbitmap,
  const_sbitmap, const_sbitmap);
 extern bool bitmap_intersect_p (const_sbitmap, const_sbitmap);
+extern bool bitmap_intersect_compl_p (const_sbitmap, const_sbitmap);
 extern bool bitmap_and (sbitmap, const_sbitmap, const_sbitmap);
 extern bool bitmap_ior (sbitmap, const_sbitmap, const_sbitmap);
 extern bool bitmap_xor (sbitmap, const_sbitmap, const_sbitmap);
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/selftest.h with-selftest/gcc/selftest.h
--- sw-selftest-base/gcc/selftest.h	2016-08-01 12:42:57.738436176 +0200
+++ with-selftest/gcc/selftest.h	2016-08-26 11:49:50.656711729 +0200
@@ -92,6 +92,11 @@
 extern void tree_cfg_c_tests ();
 extern void vec_c_tests ();
 extern void wide_int_cc_tests ();
+extern void shrink_wrapping_separate_tests ();
+
+/* Helper functions to set up certain tests.  */
+extern basic_block build_random_cfg (int);
+extern void free_random_cfg ();
 
 extern int num_passes;
 
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/selftest-run-tests.c with-selftest/gcc/selftest-run-tests.c
--- sw-selftest-base/gcc/selftest-run-tests.c	2016-08-01 12:43:07.411769580 +0200
+++ with-selftest/gcc/selftest-run-tests.c	2016-08-26 11:50:34.160045378 +0200
@@ -67,8 +67,9 @@
   spellcheck_tree_c_tests ();
   tree_cfg_c_tests ();
 
-  /* This one relies on most of the above.  */
+  /* These ones relies on most of the above.  */
   function_tests_c_tests ();
+  shrink_wrapping_separate_tests ();
 
   /* Finished running tests.  */
   long finish_time = get_run_time ();
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/shrink-wrap.c with-selftest/gcc/shrink-wrap.c
--- sw-selftest-base/gcc/shrink-wrap.c	

Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-24 Thread Segher Boessenkool
Ping x2.

On Wed, Aug 03, 2016 at 07:05:34PM -0500, Segher Boessenkool wrote:
> Ping.
> 
> 
> Segher
> 
> 
> On Mon, Aug 01, 2016 at 01:42:37AM +, Segher Boessenkool wrote:
> > This is the second version.  Concern was renamed to component, and all
> > other comments were addressed (I hope).  It still uses only two bitmaps
> > for the component placement, but now they are called needs_components
> > and has_components, which hopefully is easier to follow.  The "can this
> > prologue be moved earlier for no cost" test is generalised a bit, and
> > the cost estimate explicitly excludes backedges, which makes it easier
> > to see that prologues will not be placed inside a loop if there is no
> > benefit to that (I didn't see any different generated code because of
> > this change).  And new and updated comments all over, of course.
> > 
> > Is this okay for trunk?
> > 
> > 
> > Segher
> > 
> > 
> >  gcc/cfgcleanup.c   |   5 +
> >  gcc/common.opt |   4 +
> >  gcc/config/rs6000/rs6000.c | 257 -
> >  gcc/dce.c  |   9 +
> >  gcc/doc/invoke.texi|  11 +-
> >  gcc/doc/tm.texi|  54 
> >  gcc/doc/tm.texi.in |  29 ++
> >  gcc/emit-rtl.h |   4 +
> >  gcc/function.c |  15 +-
> >  gcc/regcprop.c |   4 +
> >  gcc/regrename.c|  13 +-
> >  gcc/sel-sched-ir.c |   1 +
> >  gcc/shrink-wrap.c  | 705 
> > +
> >  gcc/shrink-wrap.h  |   1 +
> >  gcc/target.def |  57 
> >  15 files changed, 1150 insertions(+), 19 deletions(-)
> > 
> > -- 
> > 1.9.3


Re: [PATCH v2 0/9] Separate shrink-wrapping

2016-08-03 Thread Segher Boessenkool
Ping.


Segher


On Mon, Aug 01, 2016 at 01:42:37AM +, Segher Boessenkool wrote:
> This is the second version.  Concern was renamed to component, and all
> other comments were addressed (I hope).  It still uses only two bitmaps
> for the component placement, but now they are called needs_components
> and has_components, which hopefully is easier to follow.  The "can this
> prologue be moved earlier for no cost" test is generalised a bit, and
> the cost estimate explicitly excludes backedges, which makes it easier
> to see that prologues will not be placed inside a loop if there is no
> benefit to that (I didn't see any different generated code because of
> this change).  And new and updated comments all over, of course.
> 
> Is this okay for trunk?
> 
> 
> Segher
> 
> 
>  gcc/cfgcleanup.c   |   5 +
>  gcc/common.opt |   4 +
>  gcc/config/rs6000/rs6000.c | 257 -
>  gcc/dce.c  |   9 +
>  gcc/doc/invoke.texi|  11 +-
>  gcc/doc/tm.texi|  54 
>  gcc/doc/tm.texi.in |  29 ++
>  gcc/emit-rtl.h |   4 +
>  gcc/function.c |  15 +-
>  gcc/regcprop.c |   4 +
>  gcc/regrename.c|  13 +-
>  gcc/sel-sched-ir.c |   1 +
>  gcc/shrink-wrap.c  | 705 
> +
>  gcc/shrink-wrap.h  |   1 +
>  gcc/target.def |  57 
>  15 files changed, 1150 insertions(+), 19 deletions(-)
> 
> -- 
> 1.9.3


[PATCH v2 0/9] Separate shrink-wrapping

2016-07-31 Thread Segher Boessenkool
This is the second version.  Concern was renamed to component, and all
other comments were addressed (I hope).  It still uses only two bitmaps
for the component placement, but now they are called needs_components
and has_components, which hopefully is easier to follow.  The "can this
prologue be moved earlier for no cost" test is generalised a bit, and
the cost estimate explicitly excludes backedges, which makes it easier
to see that prologues will not be placed inside a loop if there is no
benefit to that (I didn't see any different generated code because of
this change).  And new and updated comments all over, of course.

Is this okay for trunk?


Segher


 gcc/cfgcleanup.c   |   5 +
 gcc/common.opt |   4 +
 gcc/config/rs6000/rs6000.c | 257 -
 gcc/dce.c  |   9 +
 gcc/doc/invoke.texi|  11 +-
 gcc/doc/tm.texi|  54 
 gcc/doc/tm.texi.in |  29 ++
 gcc/emit-rtl.h |   4 +
 gcc/function.c |  15 +-
 gcc/regcprop.c |   4 +
 gcc/regrename.c|  13 +-
 gcc/sel-sched-ir.c |   1 +
 gcc/shrink-wrap.c  | 705 +
 gcc/shrink-wrap.h  |   1 +
 gcc/target.def |  57 
 15 files changed, 1150 insertions(+), 19 deletions(-)

-- 
1.9.3