Part of our slowness in parsing speed comes from inappropriate abstractions in our encoding and character set code. Consider Parrot_str_find_cclass(), which searches a string for the first occurrence of a member of a specific character class.
It first calls the specific character set's find_cclass() function. Look at the version in src/string/charset/ascii.c. This loops over each character in the necessary range, calling the STRING's appropriate encoding's get_codepoint() function for each character in the range. Then it compares the returned value to the table of appropriate characters to find a match for the class. If, as Rakudo does to build its actions.pir, you call Parrot_str_find_cclass() some 11 million times, and if, as Rakudo does when building actions.pir, the average number of codepoints searched is 32.9, you've performed some 318 million C function calls through C function pointers to inch through a STRING, codepoint by codepoint. I suspect that we could reclaim a fair amount of performance (shaving some 25% off of parsing time in general) if we inverted the lookup, instead passing the character set table to an encoding function which iterates itself (no function call overhead) and performs the lookup itself. We'd have to add a function to the encoding API and change Parrot_str_find_cclass() to perform this, but that should be an hour of work for an enterprising hacker for a surprising performance improvement. It's relatively low-risk as well, so if someone can do it in the next 12 hours, it may be worth including in Tuesday's release. -- c _______________________________________________ http://lists.parrot.org/mailman/listinfo/parrot-dev
