Part of our slowness in parsing speed comes from inappropriate abstractions in 
our encoding and character set code.  Consider Parrot_str_find_cclass(), which 
searches a string for the first occurrence of a member of a specific character 
class.

It first calls the specific character set's find_cclass() function.  Look at 
the 
version in src/string/charset/ascii.c.  This loops over each character in the 
necessary range, calling the STRING's appropriate encoding's get_codepoint() 
function for each character in the range.  Then it compares the returned value 
to the table of appropriate characters to find a match for the class.

If, as Rakudo does to build its actions.pir, you call Parrot_str_find_cclass() 
some 11 million times, and if, as Rakudo does when building actions.pir, the 
average number of codepoints searched is 32.9, you've performed some 318 
million C function calls through C function pointers to inch through a STRING, 
codepoint by codepoint.

I suspect that we could reclaim a fair amount of performance (shaving some 25% 
off of parsing time in general) if we inverted the lookup, instead passing the 
character set table to an encoding function which iterates itself (no function 
call overhead) and performs the lookup itself.

We'd have to add a function to the encoding API and change 
Parrot_str_find_cclass() to perform this, but that should be an hour of work 
for an enterprising hacker for a surprising performance improvement.  It's 
relatively low-risk as well, so if someone can do it in the next 12 hours, it 
may be worth including in Tuesday's release.

-- c
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev

Reply via email to