On 07/12/2017 04:44 PM, Segher Boessenkool wrote:
> On Tue, Jul 11, 2017 at 03:19:36PM -0600, Jeff Law wrote:
>> Examples of implicit probes include
> 
>>   2. ABI mandates that *sp always contain a backchain pointer (ppc)
> 
> In the ELFv2 ABI a backchain is not required.  GCC still always has
> one afaik.  I'll find out more.
Please do.  I was under the impression it was mandated by the earlier
ABIs as well.  If it isn't, then I don't think we can depend on it for
the older ABIs.

That wouldn't be the end of the world -- it's pretty clear that ppc64le
is the future and we'd get good code there.  I wouldn't lose much sleep
if ppc32 and ppc64 big endian had a less efficient probing scheme.

We'd set up a last_probe_offset tracker like we do for aarch & s390.
For ppc64le it's initial state would be zero.  For ppc32 and ppc64 big
endian the initial state would be PROBE_OFFSET - STACK_BOUNDARY /
UNITS_PER_WORD.  Depending on cost/benefit analysis we could try to
optimize those ports, but given overall directions it just might not be
worth the effort.

> 
>> To get a sense of overhead, just 1.5% of routines in glibc need probing
>> in their prologues (x86) in the testing I performed.  IIRC each and
>> every one of those routines needed just 1-4 inlined probes.
>>
>> Significantly more functions need alloca space probed (IIRC ~5%), but
>> given the amazingly inefficient alloca code, I can't believe anyone will
>> ever notice the probing overhead.
> 
> That is quite a lot of functions IMO, but it's just one stor per page
> (or per alloca), and supposedly you'll store to that stack anyway (or
> it is stupid slow code in the first place).  Did you measure any real
> timings?
Haven't measured any real timings.  We hit so few functions with the
prologue probes it's hard to see how they could end up being measurable.

THe code we generate for alloca was so awful it's hard to see how
hitting each page once would matter either.  *However* I was looking at
x86 in this case and due to potential stack realignments x86's alloca
code might be notably worse than others for constant sizes.

There's further improvements that could be made as well.   It ought to
be possible to write an optimizer pass that uses some of the ideas from
DSE and SLSR to identify explicit probes that are made redundant by
nearby implicit probes -- this would seem most useful for the dynamic space.

The problem is we'd want to do that in gimple, but probing of the
dynamic space happens at the gimple/rtl border.  So we'd probably want
to make probing happen earlier to expose stuff at the gimple level.


Jeff

Reply via email to