Re: [GSOC idea] enhance regular expressions?

Dmitry Olshansky Thu, 31 Mar 2011 14:13:54 -0700

On 31.03.2011 22:16, [email protected] wrote:

Dmitry Olshansky<[email protected]>  writes:

--- somewhat informal draft ---

Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.

Indeed there are problems, though the engine itself is quite capable andoptimized, I think it could be fixed.

I've been working on a regular expression library myself based around
Russ Cox article at http://swtch.com/~rsc/regexp/regexp2.html and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/

Wow, I'll definitely look into it.

Overall it seems solid even though incomplete. From glance it seems thatparsing a pattern is somewhat convulted, but I'll defer any critics ordecisions until I looked in closer detail.I see you also using the test from std.regex, sadly it's somewhat broken(the one with test vectors).

All it's checking proves that *there is a match*, not *what* was matched.

Anyway, I'm gathering what I have in a way of fixes as my first pullrequest, wish me luck ;)

Having regular expression (regex) engine in standard library is
totally expected for any modern language.
The solid and performant implementation of regex is one of the
greatest points of pride (if not of sale).

Couldn't agree more.

While concrete results and benchmarks got frequently biased, there are
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it
deterministic  or nondeterministic) and  backtracking, each of these
with their set of traits.
FA are more stable O(N) in time on input, however they are usually
more limited as implementing features such as backreferences becomes
problematic, also not to mention the time for constructing DFA from
pattern.

For an NFA, the complexity of the regular expression also affects the
execution time as there will probably me more states to process per
character. There is also more storage and copying required for
captures. I haven't looked closely at using a DFA, and I'm not sure how
captures would work with one.

Right, for NFA the execution time in worst case O(n*m) n - input, m -pattern length.As for DFA, it seems the process of construction full DFA eagerly fromNFA would grab a lot of ram & time O(2^^m) in worst case, which makes ituseful only on certain amounts of input, hence the idea to compute it atcompile time.

Backtracking allows easily implementing some interesting features like
backreferencing, lookahead/lookbehind and such, but has a dreadful
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,
fairly good picture of various regular expression engines can be found
here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
now idea why he claims RE2 is feature-rich.

Now speaking of D, what we have now is essentially an optimized
backtracking with some not implemented features it may very well have
had. Also consider important issue that need proper resolution
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
for a change.

The proposal consists of three parts:

1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
got accustomed with this engine internals and confident that I can get
it done. That's a short-term solution at best and something I plan to
do in spare time even if it's not considered a compelling GSOC
proposal ;)

I would really like to see this.

2)  Make another one FA-based from scratch, here things go somewhat
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with
cached DFA states as optimization tuning aka memory vs time. Someone
brought  http://swtch.com/~rsc/regexp/regexp1.html, it could be a
starting point, I see some optimization opportunities I'd like to
try. Supporting unicode of all flavors is of no question. The API for
this engine may stay the same or it even  may be integrated into
existing one, making the type of engine a compile-time parameter with
default being Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA("(D|N)?FA","g");

I think the existing range based API is very good. I also like the idea
of the regular expression engine being a compile time parameter.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_  than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an  edge over comparative
implementations. Also I can imagine quite a lot of applications not
requiring *any* regex from non-constant pattern. Major exceptions are
editors with find/replace and certain scripting/parsing toolkits.
It looks quite natural:
...
enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
...
assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");

The best speed gain most likely would be obtained by generating the
whole DFA at compile time, I believe such possibility  was previously
discussed on the NG.

This would be a real nice feature.

This diversity in kinds of regexes may warrant making regex engine an
interface ( for homogeneous storage and parameter passing) with
concrete implementations being classes.

Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?



--
Dmitry Olshansky

Re: [GSOC idea] enhance regular expressions?

Reply via email to