Link-time optimization can be turned on by adding the
 -flto flag to the proton library build, in both compilation
and linking steps.  It offers the possibility of optimizations
using deeper knowledge of the whole program than is available 
before linking.

I have also been trying to get some extra performance by hand-
inlining functions that I select based on valgrind/callgrind
profiling data.

My test procedure has been to run 50 trials, where each trial 
is a run of two programs: my psend and precv proton-C clients
written at the Engine level.  Each trial involves sending and 
receiving 5 million small messages.  

The result from each trial is a single high-resolution timing 
number.  (From just before the sender sends the first message, 
to just after the receiver receives the last message.)
The result of each test is a list of 50 of those numbers.

I compare tests using an online Student's T-test calculator.
("Student" was the pen-name of the guy who invented it.
His real name was Gosset, and he was working at the Guiness 
Brewery in Dublin when he invented it. I am not making this up.)  
The t-test gives a number that indicates the likihood 
that the difference between two tests could have happened randomly.

A small t-test result indicates that the difference between 
two test is unlikely to have happened randomly.  For example
a t-test result of 0.01 means that the difference between your 
two tests should only happen 1 time out of 100 times due to 
random chance.  Smaller results are better.

With 50 sample-points in each test, you can get nice high 
certainty as to whether you are seeing real or random results.
All of the results below are hyper-significant.  The *worst*
t-test result was 2.9e-8, i.e. 3 chances out of 100 million 
that the difference between the two tests could happen randomly.

So .. here are the results.   (in seconds)

( builds used throughout are normal release-with-debug-info,
with -O2 optimization. )

1. Proton code as of 0800 EDT yesterday, with no changes. 
   mean 41.267825   sigma 0.834826

2. LTO build
   mean 40.073661   sigma 1.108513    improvement: 2.9%

3. manual inlining changes
   mean 39.011794   sigma 1.056831    improvement: 5.5%

4. LTO build plus my changes
   mean 39.211283   sigma 1.041303    improvement: 5.0%

So!  The LTO technology really works, but it's not as
good as manual inlining based on profiling.  In fact
it slows that down a little, probably because it is choosing 
some inlining candidates that don't help enough to offset
cache thrash due to code size increase.

so there you go.

Reply via email to