Re: [OT] "Boosting" and other potential research topics

Scott A Crosby Tue, 24 May 2005 11:49:33 -0700

On Tue, 24 May 2005 19:05:49 +0100, Henry Stern <[EMAIL PROTECTED]> writes:


> I was thinking of you when I wrote that.  The open research question
> is:

It is not an open question. :)

> Can we find all the matches for n regexes in o(n^2+m)? 

This question isn't well-formed unless you use my definition for
$k,m,n$, if so, then yes, it can be done.

> Can we tell which of the component regexes have matched?

With both algorithms, you can identify exactly which regexp or strings
matched by building the automata, then when running the automata over
an input string, record all of the matches instead of only saving the
last/longest match.

>> If you're interested in the best algorithms I found, if there are $k$
>> regexps, $m$ matches and the input is of length $n$. Note that $m$
>> also counts multiple matches by the same rule at different positions.
>>
>> Worst-case time to match regexps is O(n+m) if you don't care about
>> overlapping matches and O(n^2+m)[1] if you do want them.[2] For all
>> practical purposes, it would take time linear in the input text.
>>
>> If you only care about fixed strings (no regexps), then matching any
>> number in parallel can be done in O(n+m) in the worst case.[3]
>>
>>
>> Scott
>>
>> [1] The n^2 is because a regexp may match up to O(n) text, and you
>> have to run the automata at all O(n) positions. In practice, this
>> shouldn't be any more of an issue than it is with the existing perl
>> regexp matcher.
>>
>> [2] Straightforward modification of the algorithm given in the dragon
>> book (forget the chapter), where you record each match as you discover
>> it instead of just outputting the last-found/longest match.
>>
>> [3] Straightforward modification of Aho-corasick, where you record
>> each match as you discover it instead of just outputting the
>> last-found/longest match.

Scott

Re: [OT] "Boosting" and other potential research topics

Reply via email to