Re: [OT] "Boosting" and other potential research topics

Henry Stern Tue, 24 May 2005 11:06:04 -0700

I was thinking of you when I wrote that.  The open research question is:
 Can we find all the matches for n regexes in o(n^2+m)?  Can we tell
which of the component regexes have matched?


Henry

Scott A Crosby wrote:

On Tue, 17 May 2005 14:01:09 +0100, Henry Stern <[EMAIL PROTECTED]> writes:

3.  Faster body scanning.

Every body rule in SpamAssassin requires a separate pass through the
message.  The time complexity of this is O(n).  If all of the rules
are combined into one using a trie structure, the rules can be
evaluated in O(log n) time.

However, using the latter method it still requires a (less
expensive) O(n) operation to find which rules have been satisfied.
A valuable area of research that will really help SpamAssassin would
be to develop a method of evaluating all of the regular expressions
in O(log n) time and finding exactly which rules have been satisfied
in O(log n) time.



I prototyped an implementation of this. My conclusion was that it
wouldn't be effective because too many of the regexp's SA uses can not
be represented as DFA's. Also, it wouldn't lead to a huge improvement
because perl's regexp engine doesn't dominate the runtime of SA, and,
would probably become less signifigant with time.

If you're interested in the best algorithms I found, if there are $k$
regexps, $m$ matches and the input is of length $n$. Note that $m$
also counts multiple matches by the same rule at different positions.

Worst-case time to match regexps is O(n+m) if you don't care about
overlapping matches and O(n^2+m)[1] if you do want them.[2] For all
practical purposes, it would take time linear in the input text.

If you only care about fixed strings (no regexps), then matching any
number in parallel can be done in O(n+m) in the worst case.[3]


Scott

[1] The n^2 is because a regexp may match up to O(n) text, and you
have to run the automata at all O(n) positions. In practice, this
shouldn't be any more of an issue than it is with the existing perl
regexp matcher.

[2] Straightforward modification of the algorithm given in the dragon
book (forget the chapter), where you record each match as you discover
it instead of just outputting the last-found/longest match.

[3] Straightforward modification of Aho-corasick, where you record
each match as you discover it instead of just outputting the
last-found/longest match.

signature.asc
Description: OpenPGP digital signature

Re: [OT] "Boosting" and other potential research topics

Reply via email to