Re: Trie optimisation of simple alternations for blead perl. (Updated patch, supports /i modifier)

demerphq 16 Feb 2005 04:56:26 -0000

On Tue, 15 Feb 2005 13:11:21 -0800, Justin Mason <[EMAIL PROTECTED]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Yitzchak Scott-Thoennes writes:
> > On Tue, Feb 15, 2005 at 12:40:50PM -0800, Justin Mason wrote:
> > > FWIW, this looks like it'd be excellent for SpamAssassin ;)
> > Maybe somebody with SpamAssassin could do some benchmarks?
> 
> I'm a little concerned it might be misleading -- we've optimised our code
> quite a lot around the *existing*, sans-trie, re engine, like so:
> 
>    foreach $line (@body) {
>      foreach $rule (@ruleset) {
>        next unless ($scores{$rule} != 0);  # skip zero-scoring rules
>        /$rule/ and $score+=$scores{$rule};
>      }
>    }
> 
> If this patch were included in perl, we'd rewrite our code to optimise
> it's behaviour for *this* engine, like so:
> 
>    my $allrules = eval 'qr/(?:' . join("|", @ruleset) . ')/';
>    # probably a bit more sophisticated than that, but you
>    # get the idea
> 
>    foreach $rule (@ruleset) {
>    foreach $line (@body) {
>      next unless $line =~ $allrules;
>      foreach $rule (@ruleset) { .. }   # as above
>    }
> 
> So a simple run of SpamAssassin may not illustrate the speedups
> without more changes on our side.
> 
> Plus (as far as I know) none of us have ever built bleadperl ;)
> I can try it out if there's a tarball though, and give some
> rough ideas of the speedup without our code changes?


Without support for the /i modifier it took a consistant 30 seconds
off of the process time of the full public corpus using mass-check.

This was lower than I expected so I did some analysis on the code
which made me realize that most of the SA regexes use the /i switch. I
thus decided I would implement case insensitive matching as well, and
have done so. The problem is that it seems there is a subtle flaw in
the latter as it causes mass-check to segfault about 25% through the
public corpus. Its actually been causing me to tear my hair out as
when I modify the SA code (in one of the code generators) to include
information on what regex has failed the segfault goes away. Minor
modification of the source code generator removes the bug when
mass-check runs in debug but it returns when it runs in normal mode.
(And the perl test suite is no help, all tests pass fine).

Anyway, i havent had access to the tools to properly debug this so I
have been waiting on publishing the up to date patch, but in the
interests of science ;-) im attaching it here.

I did however do some benchmarks on the perfomance of one particular
regex in the SA suite. Its from SA::Bayes and looks like this:

        ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
                m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
                t(?:his|he|ime|hrough|hat)|
                w(?:hy|here|ork|orld|ith|ithout|eb)|
                f(?:rom|or|ew)| e(?:ach|ven|mail)|
                o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
                s(?:uch|ame)| l(?:ook|ike|ong)|
                y(?:ou|our|ou're)|
                The|has|have|into|using|http|see|It's|it's|
                number|just|both|come|years|right|know|already|
                people|place|first|because|
                And|give|year|information|can)$/x);

As you can see this regex has been PreSuf'ed for a minor speedup. Run
times of this regex being used to search regcomp.c (a large file) in a
while /\b(RE)\b/ loop were prepatch: 21/s postpatch: 51/s. When the
pre-suf in the pattern is unrolled the times were prepatched:10/s 
postpatched:100/s.  (Note this is the number of times to extract all
of the matches from the file.)

Incidentally I did notice that many of the patterns used could be
reworked to be more efficient assuming the patch is applied. Also I
was kinda wondering about the code generated. It seems odd that each
regex rule gets its own subroutine, wouldnt it better to precompile
the regexes into a single routine? There appears to be low hanging
optimization fruit in the SA code generator stuff.  Ie:

      if ($self->{conf}->{scores}->{q{__NIGERIAN_BODY_18}}) {
        # call procedurally as it is faster.
        __NIGERIAN_BODY_18_body_test($self,@_);
      }

is an example of low hanging fruit. That should read:

my $lu=$self->{conf}->{scores};
if ($lu->{q{__NIGERIAN_BODY_18}}) {
        # call procedurally as it is faster.
        __NIGERIAN_BODY_18_body_test($self,@_);
}

which would save two redundant hash lookups per rule. A further
optimization would be to outright eliminate the subroutine call so
that this would look like:

my $lu=$self->{conf}->{scores};
if ($lu->{q{__NIGERIAN_BODY_18}}) {
  #dont call a sub at all, as its faster
  foreach (@_) {
     if (/\bSEVERAL ATTEMPTS HAVE BEEN MADE WITH OUT SUCCESS\b/i) { 
                $self->got_pattern_hit (q{__NIGERIAN_BODY_18}, "BODY: "); 
                dbg ("Ran body-text regex rule __NIGERIAN_BODY_18
======> got hit: match='$&'", "rulesrun", 2);
                # Ok, we hit, stop now.
        last;
     }
  }
}

Since each subroutine gets called once per line per mail the reduction
in call stack overhead should represent a pretty clear run time
improvement. I assume this logic is duplicated in the other code
generators and not just in the one I was trying to debug.

But i digress.... :-)

Cheers
Yves
ps: Attached patch includes additional tests, and fixes a bug with
re=debug mode when TRIE_DEBUG was defined. Please dont forget to
regen.pl before compiling.


-- 
First they ignore you, then they laugh at you, then they fight you,
then you win.
  +Gandhi

trie_5.patch.gz
Description: application/gzip-compressed

Re: Trie optimisation of simple alternations for blead perl. (Updated patch, supports /i modifier)

Reply via email to