On 9/16/2011 9:57 AM, Nobody wrote:
I wonder if there are any
heuristics for detecting exponential time re's.
Exponential growth results from non-determinism, i.e. if there are
multiple transitions for a given character from a given state.
Common patterns include:
...(a...)?a...
...(a...)*a...
...(a...)+a...
with a choice between matching the "a" at the start of the bracketed
pattern or the "a" following it.
(xxxa...|xxxb...|xxxc...)
with a choice between branches which cannot be resolved until more data
has been read.
For re.search (as opposed to re.match):
axxxa...
When axxx has been read, a following "a" gives a choice between
continuing the existing match with the second "a", or aborting and
matching against the first "a". For patterns which contain many copies of
the initial character, each copy creates another branch.
Also, using back-references in a regexp [sic] can be a significant
performance killer, as it rules out the use of a DFA (which, IIRC, Python
doesn't do anyhow) and can require brute-forcing many combinations. A
particularly bad case is:
(a*)\1
matching against "aaaaaaaa...".
If the performance issue is with re.match/re.search rather than with
re.compile, one option is to use ctypes to access libc's regexp functions.
Those are likely to provide better searching throughput at the expense of
potentially increased compilation time.
Now, can you write that as a heuristic *algorithm* def
dangerous_re(some_re):?
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list