I was thinking of you when I wrote that. The open research question is: Can we find all the matches for n regexes in o(n^2+m)? Can we tell which of the component regexes have matched?
Henry Scott A Crosby wrote:
On Tue, 17 May 2005 14:01:09 +0100, Henry Stern <[EMAIL PROTECTED]> writes:3. Faster body scanning. Every body rule in SpamAssassin requires a separate pass through the message. The time complexity of this is O(n). If all of the rules are combined into one using a trie structure, the rules can be evaluated in O(log n) time.However, using the latter method it still requires a (less expensive) O(n) operation to find which rules have been satisfied. A valuable area of research that will really help SpamAssassin would be to develop a method of evaluating all of the regular expressions in O(log n) time and finding exactly which rules have been satisfied in O(log n) time.I prototyped an implementation of this. My conclusion was that it wouldn't be effective because too many of the regexp's SA uses can not be represented as DFA's. Also, it wouldn't lead to a huge improvement because perl's regexp engine doesn't dominate the runtime of SA, and, would probably become less signifigant with time. If you're interested in the best algorithms I found, if there are $k$ regexps, $m$ matches and the input is of length $n$. Note that $m$ also counts multiple matches by the same rule at different positions. Worst-case time to match regexps is O(n+m) if you don't care about overlapping matches and O(n^2+m)[1] if you do want them.[2] For all practical purposes, it would take time linear in the input text. If you only care about fixed strings (no regexps), then matching any number in parallel can be done in O(n+m) in the worst case.[3] Scott [1] The n^2 is because a regexp may match up to O(n) text, and you have to run the automata at all O(n) positions. In practice, this shouldn't be any more of an issue than it is with the existing perl regexp matcher. [2] Straightforward modification of the algorithm given in the dragon book (forget the chapter), where you record each match as you discover it instead of just outputting the last-found/longest match. [3] Straightforward modification of Aho-corasick, where you record each match as you discover it instead of just outputting the last-found/longest match.
signature.asc
Description: OpenPGP digital signature
