I have a fine book recommendation: "Programming Pearls" by Jon Bentley. http://www.cs.bell-labs.com/cm/cs/pearls/
He advises to use simple and special purpose tools for such kind of problems. Essentially that would be some kind of awk and sort, instead
of a customized perl xs. :)
Ok, downstairs is my hint to your problem:
PerlDiscuss - Perl Newsgroups and mailing lists schrieb:
Background. We have about a billion records that we need to merge from 100
files. We have devised a customised algorithm that reduces the runtime
from weeks (using a RDBMS) to about 12 hours. The main bottleneck
currently in this is a need to sort an array (of arrays) composed of 100
items.
[ [ 'a_str', val ], [ 'b_str', val ], ... [ '100th_str', val ], ]
The sort occurs on the 'X_str' string ie $ary->[N]->[0]
Currently we are using perls native sort which of course works fine, however in 99% of the cases only the $ary->[0]->[0] element will be out of place. The remainder of the 'list' will have remained sorted. Native perl sort can not take advantage of this, but we would like to, as on average this will mean only 50 cmps (1-99) are required to correctly position the element. Sort requires at least 99 every time so it should be much faster to do this.
Here is an implementation in perl to show what we want to achieve in case that was not clear enough
##moveFirst n=200000 Took 56 seconds 3571.43/sec #sort n=200000 Took 85 seconds 2352.94/sec
sub moveFirstElementIntoPosition { my $ref = shift;
# if we are already in place return return if $ref->[0]->[0] le $ref->[1]->[0];
# bubble it into position for (1..$#{$ref}) { if ( $ref->[$_-1]->[0] le $ref->[$_]->[0] ) { return } else { ( $ref->[$_-1], $ref->[$_] ) = ( $ref->[$_], $ref->[$_-1] ); } } }
my $s = time();
my $n = 200_000;
my @test = map { [ $_, $a++ ] } ( 'ca', 'aa' .. 'bz', 'cb' .. 'dz' );
for ( 1 .. $n ) { my @ary = @test;
moveFirstElementIntoPosition( [EMAIL PROTECTED] );
}
$s = time() - $s;
printf "moveFirst n=$n Took %d seconds %.2f/sec\n", $s, $n/$s;
$s = time();
for ( 1 .. $n ) { my @ary = @test;
@ary = sort { $a->[0] cmp $b->[0] } @ary;
}
$s = time() - $s;
printf "sort n=$n Took %d seconds %.2f/sec\n", $s, $n/$s;
As you can see even implemented in Perl the move algorithm beats sort by some 30%. I would expect to be able to beat sort by 50%+ if we implemented something similar in XS. With a billion sorts even a small increase in throughput speed is significant, thus the desire.
If the data existed in a C struct I would have no problem in implementing this, but of course it does not and although I have used some XS to implement a number of algorithms where speed was of the essence I have not tackled anything similar in the past. In other words my XS is not currently up to it. I have had a look at sort.c in the source and the cmp function(s). I would welcome any suggestions as to the most efficient way to implement this moveFirstElementIntoPosition() function. I can see ways to do using my limited XS by returning a new array ref but ideally would just like to modify the array passed to the function in place to avoid the overheads of data copying and just have a void return.
Can anyone point me in the right direction, perhaps to some XS code that modifies an array in place or anything else that might be useful.
I also tried sort-in-place in my Tie::CArray implementation,
esp. suited to such a bulk of external data.
I needed it for comp geom problems, several hundred thousands of 2d or 3d points.
http://search.cpan.org/src/RURBAN/Tie-CArray-0.12/CArray.xs
http://search.cpan.org/~rurban/Tie-CArray-0.12/CArray.pm#CLASS_METHODS
=> see isort
but my isort was never implemented, because of the funarg (to perl) in xs. but you can take that nreverse code and sort it in xs. no major problem.
Note: the ARBITRARY STRUCTURED ARRAYS, PACK-STYLE TEMPLATES code is also not enabled per default.
-- Reini Urban http://xarch.tu-graz.ac.at/home/rurban/