Some of the things I'd like to see in the regex implementation:

All functions accepting a compiled regex object/struct should also accept a string version of the pattern (and vice versa). Some implementations (Java) only accept the compiled version in some places and the string pattern in other places. That's annoying.

Just like with ordinary string-searching functions, you should be able to specify a start position (and maybe an end position) for the search. Even if the match exists somewhere in the string, it fails if not found within the target slice. Something like this:

   auto text = "ABCDEFG";
   auto pattern = regex("[ABCEFG]");

   // returns false, because the char at position 3 does not match
   auto result = match(text, 3);

   // this should be exactly equivalent (but the previous version
   // uses less memory, and ought to work with infinite ranges, whereas
   // the slice version wouldn't make any sense)
   auto equivalent = match(text[3..$]);

I've needed to use this technique in a few cases to implement a simple lexical scanner, and it's a godsend, if the regex engine supports it (though most don't).

Finally, it'd be extremely cool if the regex compiler automatically eliminated redundant nodes from its NFA, converting as much of it as possible to a DFA. I did some work on this a few years ago, and it's actually remarkably simple to implement using prefix trees.

   // These two expressions produce an identical set of matches,
   // but the first one is functionally an NFA, while the second
   // one is a DFA.
   auto a = regex("(cat|car|cry|dog|door|dry)");
   auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");

In cases where the expression can only be partially simplified, you can leave some NFA nodes deep within the tree, while still DFA-ifying the rest of it:

   auto a = regex("(attitude|attribute|att.+ion");
   auto b = regex("(att(?:itude|ribute|.+ion)");

It's a very simple transformation, increases speed (dramatically) for complex regular expressions (especially those produced dynamically at runtime by combining large sets of unrelated target expressions), and it reliably produces equivalent results with the inefficient version.

The only really tricky part is if the subexpressions have their own capturing groups, in which case the DFA transformation screws up the ordinal-numbering of the resultant captures.

Anyhoo...

I don't have any strong feelings about the function names (though I'd rather have functions that operators, like "~", for searching and matching).

And I don't have any strong feelings about whether the compiled regex is an object or a struct (though I prefer reference semantics over value semantics for regexen, and right now, I think that makes objects the (slightly) better choice).

Thanks for your hard work! I've implemented a small regex engine before, so I know it's no small chunk of effort. Regular expressions are my personal favorite "tiny language", and I'm glad to see them get some special attention in phobos2.

--benji

Reply via email to