Re: Is str ~ regex the root of all evil, or the leaf of all good?

Benji Smith Thu, 19 Feb 2009 20:15:20 -0800

Some of the things I'd like to see in the regex implementation:

All functions accepting a compiled regex object/struct should alsoaccept a string version of the pattern (and vice versa). Someimplementations (Java) only accept the compiled version in some placesand the string pattern in other places. That's annoying.

Just like with ordinary string-searching functions, you should be ableto specify a start position (and maybe an end position) for the search.Even if the match exists somewhere in the string, it fails if not foundwithin the target slice. Something like this:


   auto text = "ABCDEFG";
   auto pattern = regex("[ABCEFG]");

   // returns false, because the char at position 3 does not match
   auto result = match(text, 3);

   // this should be exactly equivalent (but the previous version
   // uses less memory, and ought to work with infinite ranges, whereas
   // the slice version wouldn't make any sense)
   auto equivalent = match(text[3..$]);

I've needed to use this technique in a few cases to implement a simplelexical scanner, and it's a godsend, if the regex engine supports it(though most don't).

Finally, it'd be extremely cool if the regex compiler automaticallyeliminated redundant nodes from its NFA, converting as much of it aspossible to a DFA. I did some work on this a few years ago, and it'sactually remarkably simple to implement using prefix trees.


   // These two expressions produce an identical set of matches,
   // but the first one is functionally an NFA, while the second
   // one is a DFA.
   auto a = regex("(cat|car|cry|dog|door|dry)");
   auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");

In cases where the expression can only be partially simplified, you canleave some NFA nodes deep within the tree, while still DFA-ifying therest of it:


   auto a = regex("(attitude|attribute|att.+ion");
   auto b = regex("(att(?:itude|ribute|.+ion)");

It's a very simple transformation, increases speed (dramatically) forcomplex regular expressions (especially those produced dynamically atruntime by combining large sets of unrelated target expressions), and itreliably produces equivalent results with the inefficient version.

The only really tricky part is if the subexpressions have their owncapturing groups, in which case the DFA transformation screws up theordinal-numbering of the resultant captures.


Anyhoo...

I don't have any strong feelings about the function names (though I'drather have functions that operators, like "~", for searching and matching).

And I don't have any strong feelings about whether the compiled regex isan object or a struct (though I prefer reference semantics over valuesemantics for regexen, and right now, I think that makes objects the(slightly) better choice).

Thanks for your hard work! I've implemented a small regex engine before,so I know it's no small chunk of effort. Regular expressions are mypersonal favorite "tiny language", and I'm glad to see them get somespecial attention in phobos2.


--benji

Re: Is str ~ regex the root of all evil, or the leaf of all good?

Reply via email to