Hi!
I am a PhD student doing some research in regular expressions. I have
implemented an algorithm for matching a subclass of the regular
expressions with "numerical constraints", that is, expressions like
"a{10,100}". My starting point was the problems that can occur when
using large numbers, or nested constraints, as for example in the command:
grep -E "([0-9]{1,2}h([1-5]?[0-9]m([1-5]?[0-9]s){1,60}){1,60}){0,100}"
which consumes over 2 gb of memory on my computer.
from the grep man-page:
"Large repetition counts in the {n,m} construct may cause grep to use
lots of memory."
An article describing the algorithm is available at
http://hdl.handle.net/1956/3628
A C-implementation is available at
http://www.ii.uib.no/~dagh/facgrep
The implementation is probably buggy, as I am not very good in C, but it
works for simple examples. On the example above, procps reports that it
uses around 2 megabytes of memory.
I wondered if you are interested in having this kind of functionality in
GNU grep? Note that for the majority of normal/simple regexps, grep does
better than my program above. Also, it only handles a (well-defined)
subclass of regexps. So we would need to find a nice way to choose which
algorithm to use when.
Best regards, Dag Hovland
smime.p7s
Description: S/MIME Cryptographic Signature
