Nick's performance theory - was "inline mania"

Nick Ing-Simmons Tue, 01 Aug 2000 15:10:50 -0700
John Tobey <[EMAIL PROTECTED]> writes:
>> Maybe not for void functions with no args, tail-called and with 
>> no prefix, but in more typically cases yes it can be different
>> the "function-ness" of i_foo applies constaints on where args
>> and "result" are which optimizer _may_ not be able to unravel.
>
>May not be able because of what the Standard says, or because of
>suboptimal optimization?

suboptimal optimization, (i.e. lack of knowledge about rest of program
at time of expansion) - note that a suitably "optimal" optimizer 
_could_ turn 100,000 #define-d lines back into "local real functions".

But is usually much easier add entropy - so start with its the same 
function - call it, and let compiler decide which ones to expand.

>
>GCC won't unless you go -O3 or above.  This is why many people (me
>included) stop at -O2 for most programs.

Me too - because I _fundamentally_ believe inlining is nearly always
sub-optimal for real programs.

But -O3 (or -finline-functions) is there for the folk that want 
to believe the opposite. 

And there is -Dinline -D__inline__ for the inline case.
What there isn't though is -fhash_define-as-inline or -fno-macros
so at very least lets avoid _that_ path.

>
>> >Non-inline functions have their place in reducing code size
>> >and easing debugging.  I just want an i_foo for every foo that callers
>> >will have the option of using.
>> 
>> Before we make any promises to do all that extra work can we 
>> measure (for various architectures) the cost of a real call vs inline.
>> 
>> I want proof that inline makes X% difference.
>
>I'm not going to prove that.  A normal C function call involves
>several instructions and a jump most likely across page boundaries.

I have said this before but the gist of the Nick-theory is:

Page boundaries are a don't care unless there is a page miss.
Page misses are so costly that everything else can be ignored,
but for sane programs they should only be incured at "startup".
(Reducing code size e.g. no inline only helps here - less pages to load.)

It is cache that matters.

Modern processors (can) execute several instructions per-cycle.
In contrast a cache miss to 100MHz SDRAM costs a 500MHz processor
more than 5-cycles (say up to 10 instructions for 2-way super-scalar) 
per word missed.

I used to think that this was a "RISC Processor only" argument.
But is seems (no hard numbers yet) that Pentium at least follows
same pattern.

>If someone else wants to prove this, great.  I just don't think it's
>that much trouble.  (mostly psychological - what will people think if
>they see that all our code is in headers and all our C files are
>autogenerated?)

We can unlink the .c files once we have compiled them ;-)

-- 
Nick Ing-Simmons
Nick's performance theory - was "inline mania"

Reply via email to