Regular expressions with high or nested repetition counters

Dag Hovland Mon, 03 May 2010 01:30:07 -0700

Hi!

I am a PhD student doing some research in regular expressions. I have
implemented an algorithm for matching a subclass of the regular
expressions with "numerical constraints", that is, expressions like
"a{10,100}". My starting point was the problems that can occur when
using large numbers, or nested constraints, as for example in the command:


grep -E "([0-9]{1,2}h([1-5]?[0-9]m([1-5]?[0-9]s){1,60}){1,60}){0,100}"

which consumes over 2 gb of memory on my computer.


from the grep man-page:
"Large repetition counts in the {n,m} construct may cause grep to use
lots of  memory."

An article describing the algorithm is available at
http://hdl.handle.net/1956/3628

A C-implementation is available at
http://www.ii.uib.no/~dagh/facgrep

The implementation is probably buggy, as I am not very good in C, but it
works for simple examples. On the example above, procps reports that it
uses around 2 megabytes of memory.

I wondered if you are interested in having this kind of functionality in

GNU grep? Note that for the majority of normal/simple regexps, grep does better than my program above. Also, it only handles a (well-defined) subclass of regexps. So we would need to find a nice way to choose which algorithm to use when.


Best regards,

Dag Hovland

smime.p7s
Description: S/MIME Cryptographic Signature

Regular expressions with high or nested repetition counters

Reply via email to