Greetings,
I've modified Inline to allow unaltered XS to be used. This will be available in
version 0.43 of Inline which will be available in the next few days. I'll make the
current version available here:
http://ttul.org/~ingy/Inline-0.43.tar.gz
I've always wanted to run some benchmarks to see if the wrapper approach used by
Inline was significantly slower than what you could do with plain XS. The only real
difference is that each "subroutine" is divided into two functions, one to do the
typemapping (the wrapper) and one to do the work (the worker :). With Inline you never
code/see the wrapper, just the worker. (Of course you can always do some of the
wrapping in the worker.) With XS you can "inline" the worker code into your wrapper.
And I wanted to know if this had any performance benefits.
So I selected the worst case. The worker code is only an integer add. This places all
the performance burden on the linkage/typemapping code. I wrote the add functionality
5 different ways, and used Benchmark.pm to test it. I ran the test program 4 times.
Here is the output:
Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP,
5-Inline...
1-Perl: 36 wallclock secs (16.05 usr + 0.03 sys = 16.08 CPU)
2-XS Wrap: 31 wallclock secs (13.87 usr + 0.02 sys = 13.89 CPU)
3-XS CODE: 28 wallclock secs (13.36 usr + 0.03 sys = 13.40 CPU)
4-XS PP: 27 wallclock secs (11.78 usr + 0.02 sys = 11.79 CPU)
5-Inline: 27 wallclock secs ( 9.64 usr + 0.02 sys = 9.66 CPU)
Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP,
5-Inline...
1-Perl: 48 wallclock secs (17.90 usr + 0.02 sys = 17.92 CPU)
2-XS Wrap: 39 wallclock secs (12.93 usr + 0.01 sys = 12.94 CPU)
3-XS CODE: 29 wallclock secs (12.25 usr + 0.02 sys = 12.26 CPU)
4-XS PP: 20 wallclock secs (10.18 usr + 0.01 sys = 10.20 CPU)
5-Inline: 25 wallclock secs (10.50 usr + 0.01 sys = 10.51 CPU)
Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP,
5-Inline...
1-Perl: 43 wallclock secs (15.90 usr + 0.00 sys = 15.90 CPU)
2-XS Wrap: 18 wallclock secs ( 9.21 usr + 0.00 sys = 9.21 CPU)
3-XS CODE: 24 wallclock secs ( 8.73 usr + 0.01 sys = 8.74 CPU)
4-XS PP: 29 wallclock secs ( 9.10 usr + 0.01 sys = 9.10 CPU)
5-Inline: 18 wallclock secs ( 9.98 usr + 0.01 sys = 9.98 CPU)
Benchmark: timing 2000000 iterations of 1-Perl, 2-XS Wrap, 3-XS CODE, 4-XS PP,
5-Inline...
1-Perl: 14 wallclock secs (14.11 usr + -0.00 sys = 14.11 CPU)
2-XS Wrap: 10 wallclock secs ( 9.20 usr + 0.00 sys = 9.20 CPU)
3-XS CODE: 10 wallclock secs ( 8.69 usr + 0.01 sys = 8.70 CPU)
4-XS PP: 10 wallclock secs ( 9.52 usr + -0.00 sys = 9.51 CPU)
5-Inline: 9 wallclock secs ( 9.20 usr + 0.00 sys = 9.21 CPU)
---
Here's an explanation of each test. (The actual code is below)
1-Perl) This test was pure perl. It always takes the longest. But not a lot longer
because the operation is so simple.
2-XS Wrap) This function uses the same wrapping style as Inline. In fact, this is what
most Inline functions look like under the hood.
3-XS CODE) This XSUB used the CODE directive to put the worker code inline.
4-XS PP) This one was hand tuned to be as fast as possible. It throws away parameter
checking, that usually accompanies XSUBs. It also makes a substantial sacrifice in
maintainability.
5-Inline) This is the normal Inline way of doing things. It is obviously the most
readable/maintainable of the C functions.
The results of these tests are not 100% consistent, but they do seem to suggest that
all of the C ways of doing things are on equal footing, and that there is not a lot to
gain, if anything, by tuning the linkage code. This would seem to indicate, that when
using XS, Perl is to blame for most of the execution time as far as linkage is
concerned and therefore cannot be improved upon by the application code whether using
Inline or XS.
This speaks well for Inline. Although you are giving up a level of control to Inline's
abstraction, that level seems not to be critical to performance.
Here is the script that I used. (Remember, it requires Inline 0.43)
---8<---
# First Section - uses just XS
use Inline C => DATA =>
ENABLE => 'XSMODE',
NAME => 'foo';
# Second Section - normal Inline
use Inline C;
use Benchmark;
timethese( 2000000,
{
'1-Perl' => sub { add1(3, 5) },
'2-XS Wrap' => sub { add2(3, 5) },
'3-XS CODE' => sub { add3(3, 5) },
'4-XS PP' => sub { add4(3, 5) },
'5-Inline' => sub { add5(3, 5) },
}
);
sub add1 {
return $_[0] + $_[1];
}
__END__
__C__
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
int add2 (int x, int y) {
return x + y;
}
MODULE = foo PACKAGE = main
int
add2 (x, y)
int x
int y
int
add3 (x, y)
int x
int y
CODE:
RETVAL = x + y;
OUTPUT:
RETVAL
void
add4 (...)
PPCODE:
{
SV* sum = sv_newmortal();
sv_setiv(sum, (IV)((int)SvIV(ST(0)) + (int)SvIV(ST(1))));
ST(0) = sum;
XSRETURN(1);
}
__C__
int add5 (int x, int y) {
return x + y;
}
---8<---
Cheers, Brian