Re: Is str ~ regex the root of all evil, or the leaf of all good?

Andrei Alexandrescu Thu, 19 Feb 2009 06:50:35 -0800

Don wrote:

Andrei Alexandrescu wrote:
I'm almost done rewriting the regular expression engine, and somepretty interesting things have transpired.
First, I separated the engine into two parts, one that is the actualregular expression engine, and the other that is the state of thematch with some particular input. The previous code combined the twointo a huge class. The engine (written by Walter) translates the regexstring into a bytecode-compiled form. Given that there is adeterministic correspondence between the regex string and thebytecode, the Regex engine object is in fact invariant and cached bythe implementation. Caching makes for significant time savings even ife.g. the user repeatedly creates a regular expression engine in a loop.
In contrast, the match state depends on the input string. I defined itto implement the range interface, so you can either inspect itdirectly or iterate it for all matches (if the "g" option was passedto the engine).
The new codebase works with char, wchar, and dchar and anyrandom-access range as input (forward ranges to come, and at somepoint in the future input ranges as well). In spite of the addedflexibility, the code size has shrunk from 3396 lines to 2912 lines. Iplan to add support for binary data (e.g. ubyte - handling binary fileformats can benefit a LOT from regexes) and also, probablyunprecedented, support for arbitrary types such as integers, floatingpoint numbers, structs, what have you. any type that supportscomparison and ranges is a good candidate for regular expressionmatching. I'm not sure how regular expression matching can beharnessed e.g. over arrays of int, but I suspect some pretty coolapplications are just around the corner. We can introduce thatgeneralization without adding complexity and there is nothing inprinciple opposed to it.
The interface is very simple, mainly consisting of the functionsregex(), match(), and sub(), e.g.
foreach (e; match("abracazoo", regex("a[b-e]", "g")))
    writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

Two other syntactic options are available:

"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")

I could have made match a member of regex:

regex("a[b-e]", "g")).match("abracazoo")
but most regex code I've seen mentions the string first and the regexsecond. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering:

foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
    writeln(e);
In general I'm weary of unwitting operator overloading, but I thinkthis case is more justified than others. Thoughts?
Andrei
I agree with the comments against ~.
I believe this Perl6 document is a must-read:

http://dev.perl.org/perl6/doc/design/apo/A05.html
There are some excellent observations there, especially near thebeginning. By separating the engine from the state of the match, youopen the possibilty of subsequently providing cleaner regex syntax.


I'd read it a while ago, but a refresher is in order. Thanks!

I do wonder though, how you'd deal with a regex which includes a matchto a literal string provided as a variable. Would this be passed to theengine, or to the match state?


At the moment these are not supported. It's a good question.

If the engine is using backtracking, there's no difference in thegenerated bytecode; but if it's creating an automata, the compiledengine depends on the contents of the string variable.

The current engine is, to the best of my understanding, usingbacktracking. At least when there's an "or", it tries both matches asrecursive calls and picks the longest.



Andrei

Re: Is str ~ regex the root of all evil, or the leaf of all good?

Reply via email to