On 9/16/2011 9:57 AM, Nobody wrote:
I wonder if there are any heuristics for detecting exponential time re's.Exponential growth results from non-determinism, i.e. if there are multiple transitions for a given character from a given state. Common patterns include: ...(a...)?a... ...(a...)*a... ...(a...)+a... with a choice between matching the "a" at the start of the bracketed pattern or the "a" following it. (xxxa...|xxxb...|xxxc...) with a choice between branches which cannot be resolved until more data has been read. For re.search (as opposed to re.match): axxxa... When axxx has been read, a following "a" gives a choice between continuing the existing match with the second "a", or aborting and matching against the first "a". For patterns which contain many copies of the initial character, each copy creates another branch. Also, using back-references in a regexp [sic] can be a significant performance killer, as it rules out the use of a DFA (which, IIRC, Python doesn't do anyhow) and can require brute-forcing many combinations. A particularly bad case is: (a*)\1 matching against "aaaaaaaa...". If the performance issue is with re.match/re.search rather than with re.compile, one option is to use ctypes to access libc's regexp functions. Those are likely to provide better searching throughput at the expense of potentially increased compilation time.
Now, can you write that as a heuristic *algorithm* def dangerous_re(some_re):?
-- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
