Request for testing regexp engine optimisation from spamassassin developers using bleadperl

demerphq 11 Feb 2005 14:06:20 -0000

Hi,

Ive recently posted a patch to the bleadperl regex engine that should
speed up certain parts of the spamassissin codebase, possibly
considerably. Unfortunately im not really in a position to test using
spamassissin and I was wondering if anybody would be interested in
assisting me in stress testing it.


The optimization works by making matches of
(lists|of|words|in|an|alternation) much more efficient by doing what
Regexp::PreSuf does and what some of the byhand regex optimisations I
see in the Spamassassin code base do in pattern form internally in the
regexp engine (meaning the benefit is multiplied by the number of
distinct prefixes involved).

What this means is that the number of characters read when matching a
list of words is _at_most_ the number of characters in the longest
matching prefix.  For long matches this is a massive speedup. Let me
give you an example, in Spamassassin::Bayes there is this regexp:

($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
              m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
               t(?:his|he|ime|hrough|hat)|
               w(?:hy|here|ork|orld|ith|ithout|eb)|
               f(?:rom|or|ew)| e(?:ach|ven|mail)|
               o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
               s(?:uch|ame)| l(?:ook|ike|ong)|
               y(?:ou|our|ou're)|
               The|has|have|into|using|http|see|It's|it's|
               number|just|both|come|years|right|know|already|
               people|place|first|because|
               And|give|year|information|can)$/x);

This regexp will in worst case have to try the first character of
every alternation in the list.  With the optimisation a single lookup
would occur. In fact this regexp would run much faster with the
optimisation were all of the regexp prefix compression removed, as
each sub () would have to be converted into its own TRIE regop.

An example, bleadperl searching a medium size perl source code file
for all of the uses of perl keywords in the file using bleadperl and
trie patched blead perl were that for every 10 times bleadperl would
parse the file the patched version could do it 1000 times using
identical code. Since this type of thing is similar to the token regex
and certain other regexes I see in the SA codebase I reckon you guys
would be a good test to see how much benefit real world programs would
experience.

Anyway, im sorry to intrude like this, but i thought some of you guys
would be interested.

Im not on list so please include me in any follow ups to this. 

The patch for blead perl and my posting to perl5-porters is here:

  http://www.mail-archive.com/[email protected]/msg84782.html

If you have any questions about applying the patch or whatever let me know.

Cheers,
yves




-- 
First they ignore you, then they laugh at you, then they fight you,
then you win.
  +Gandhi

Request for testing regexp engine optimisation from spamassassin developers using bleadperl

Reply via email to