Inline and XS benchmarks

Brian Ingerson Sat, 30 Jun 2001 14:00:51 -0700
Greetings,

I've modified Inline to allow unaltered XS to be used. This will be available in 
version 0.43 of Inline which will be available in the next few days. I'll make the 
current version available here: 

    http://ttul.org/~ingy/Inline-0.43.tar.gz

I've always wanted to run some benchmarks to see if the wrapper approach used by 
Inline was significantly slower than what you could do with plain XS. The only real 
difference is that each "subroutine" is divided into two functions, one to do the 
typemapping (the wrapper) and one to do the work (the worker :). With Inline you never 
code/see the wrapper, just the worker. (Of course you can always do some of the 
wrapping in the worker.) With XS you can "inline" the worker code into your wrapper. 
And I wanted to know if this had any performance benefits.

So I selected the worst case. The worker code is only an integer add. This places all 
the performance burden on the linkage/typemapping code. I wrote the add functionality 
5 different ways, and used Benchmark.pm to test it. I ran the test program 4 times. 
Here is the output:

Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP, 
5-Inline...
    1-Perl: 36 wallclock secs (16.05 usr +  0.03 sys = 16.08 CPU)
 2-XS Wrap: 31 wallclock secs (13.87 usr +  0.02 sys = 13.89 CPU)
 3-XS CODE: 28 wallclock secs (13.36 usr +  0.03 sys = 13.40 CPU)
   4-XS PP: 27 wallclock secs (11.78 usr +  0.02 sys = 11.79 CPU)
  5-Inline: 27 wallclock secs ( 9.64 usr +  0.02 sys =  9.66 CPU)

Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP, 
5-Inline...
    1-Perl: 48 wallclock secs (17.90 usr +  0.02 sys = 17.92 CPU)
 2-XS Wrap: 39 wallclock secs (12.93 usr +  0.01 sys = 12.94 CPU)
 3-XS CODE: 29 wallclock secs (12.25 usr +  0.02 sys = 12.26 CPU)
   4-XS PP: 20 wallclock secs (10.18 usr +  0.01 sys = 10.20 CPU)
  5-Inline: 25 wallclock secs (10.50 usr +  0.01 sys = 10.51 CPU)

Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP, 
5-Inline...
    1-Perl: 43 wallclock secs (15.90 usr +  0.00 sys = 15.90 CPU)
 2-XS Wrap: 18 wallclock secs ( 9.21 usr +  0.00 sys =  9.21 CPU)
 3-XS CODE: 24 wallclock secs ( 8.73 usr +  0.01 sys =  8.74 CPU)
   4-XS PP: 29 wallclock secs ( 9.10 usr +  0.01 sys =  9.10 CPU)
  5-Inline: 18 wallclock secs ( 9.98 usr +  0.01 sys =  9.98 CPU)

Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP, 
5-Inline...
    1-Perl: 14 wallclock secs (14.11 usr + -0.00 sys = 14.11 CPU)
 2-XS Wrap: 10 wallclock secs ( 9.20 usr +  0.00 sys =  9.20 CPU)
 3-XS CODE: 10 wallclock secs ( 8.69 usr +  0.01 sys =  8.70 CPU)
   4-XS PP: 10 wallclock secs ( 9.52 usr + -0.00 sys =  9.51 CPU)
  5-Inline:  9 wallclock secs ( 9.20 usr +  0.00 sys =  9.21 CPU)

---

Here's an explanation of each test. (The actual code is below)

1-Perl) This test was pure perl. It always takes the longest. But not a lot longer 
because the operation is so simple.

2-XS Wrap) This function uses the same wrapping style as Inline. In fact, this is what 
most Inline functions look like under the hood.

3-XS CODE) This XSUB used the CODE directive to put the worker code inline.

4-XS PP) This one was hand tuned to be as fast as possible. It throws away parameter 
checking, that usually accompanies XSUBs. It also makes a substantial sacrifice in 
maintainability.

5-Inline) This is the normal Inline way of doing things. It is obviously the most 
readable/maintainable of the C functions.

The results of these tests are not 100% consistent, but they do seem to suggest that 
all of the C ways of doing things are on equal footing, and that there is not a lot to 
gain, if anything, by tuning the linkage code. This would seem to indicate, that when 
using XS, Perl is to blame for most of the execution time as far as linkage is 
concerned and therefore cannot be improved upon by the application code whether using 
Inline or XS.

This speaks well for Inline. Although you are giving up a level of control to Inline's 
abstraction, that level seems not to be critical to performance.

Here is the script that I used. (Remember, it requires Inline 0.43)

---8<---
# First Section - uses just XS
use Inline C => DATA =>
           ENABLE => 'XSMODE',
           NAME => 'foo';

# Second Section - normal Inline
use Inline C;

use Benchmark;
timethese( 2000000,
           {
            '1-Perl'    => sub { add1(3, 5) },
            '2-XS Wrap' => sub { add2(3, 5) },
            '3-XS CODE' => sub { add3(3, 5) },
            '4-XS PP'   => sub { add4(3, 5) },
            '5-Inline'  => sub { add5(3, 5) },
           }
         );

sub add1 {
    return $_[0] + $_[1];
}

__END__
__C__
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"

int add2 (int x, int y) {
    return x + y;
}

MODULE = foo       PACKAGE = main

int
add2 (x, y)
        int     x
        int     y

int
add3 (x, y)
        int     x
        int     y
    CODE:
        RETVAL = x + y;
    OUTPUT:
        RETVAL

void
add4 (...)
    PPCODE:
        {
            SV* sum = sv_newmortal();
            sv_setiv(sum, (IV)((int)SvIV(ST(0)) + (int)SvIV(ST(1))));
            ST(0) = sum;
            XSRETURN(1);
        }

__C__
int add5 (int x, int y) {
    return x + y;
}
---8<---

Cheers, Brian
Inline and XS benchmarks

Reply via email to