Don wrote:
Andrei Alexandrescu wrote:
I'm almost done rewriting the regular expression engine, and some
pretty interesting things have transpired.
First, I separated the engine into two parts, one that is the actual
regular expression engine, and the other that is the state of the
match with some particular input. The previous code combined the two
into a huge class. The engine (written by Walter) translates the regex
string into a bytecode-compiled form. Given that there is a
deterministic correspondence between the regex string and the
bytecode, the Regex engine object is in fact invariant and cached by
the implementation. Caching makes for significant time savings even if
e.g. the user repeatedly creates a regular expression engine in a loop.
In contrast, the match state depends on the input string. I defined it
to implement the range interface, so you can either inspect it
directly or iterate it for all matches (if the "g" option was passed
to the engine).
The new codebase works with char, wchar, and dchar and any
random-access range as input (forward ranges to come, and at some
point in the future input ranges as well). In spite of the added
flexibility, the code size has shrunk from 3396 lines to 2912 lines. I
plan to add support for binary data (e.g. ubyte - handling binary file
formats can benefit a LOT from regexes) and also, probably
unprecedented, support for arbitrary types such as integers, floating
point numbers, structs, what have you. any type that supports
comparison and ranges is a good candidate for regular expression
matching. I'm not sure how regular expression matching can be
harnessed e.g. over arrays of int, but I suspect some pretty cool
applications are just around the corner. We can introduce that
generalization without adding complexity and there is nothing in
principle opposed to it.
The interface is very simple, mainly consisting of the functions
regex(), match(), and sub(), e.g.
foreach (e; match("abracazoo", regex("a[b-e]", "g")))
writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
Two other syntactic options are available:
"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")
I could have made match a member of regex:
regex("a[b-e]", "g")).match("abracazoo")
but most regex code I've seen mentions the string first and the regex
second. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering:
foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
writeln(e);
In general I'm weary of unwitting operator overloading, but I think
this case is more justified than others. Thoughts?
Andrei
I agree with the comments against ~.
I believe this Perl6 document is a must-read:
http://dev.perl.org/perl6/doc/design/apo/A05.html
There are some excellent observations there, especially near the
beginning. By separating the engine from the state of the match, you
open the possibilty of subsequently providing cleaner regex syntax.
I'd read it a while ago, but a refresher is in order. Thanks!
I do wonder though, how you'd deal with a regex which includes a match
to a literal string provided as a variable. Would this be passed to the
engine, or to the match state?
At the moment these are not supported. It's a good question.
If the engine is using backtracking, there's no difference in the
generated bytecode; but if it's creating an automata, the compiled
engine depends on the contents of the string variable.
The current engine is, to the best of my understanding, using
backtracking. At least when there's an "or", it tries both matches as
recursive calls and picks the longest.
Andrei