Pygments Backtracking Regex

Adam Gomaa Wed, 14 May 2008 17:25:48 -0700

First of all, the "guest/guest" auth doesn't seem to be working on
Pygments' Trac - I'd have submitted this there if possible.


More importantly, pygments.util.tag_re, a regex that is notably used
in pygments.util.looks_like_xml - has the potential to be extremely
slow on moderate-to-large-sized files that 'look' like XML without
being valid. I noticed this today when browsing my code repo, when
looking at a Mako file which had an unrecognized extension - Apache
spun off to 100% CPU. After duplicating locally, I originally thought
that it was an infinite loop, but some debugging revealed that it was
actually stuck in the regex search. Here's the relevant code, both
snippets from pygments/util.py

tag_re = re.compile(r'<(.+?)(\s.*?)?>.*?</\1>(?uism)')
# ...
def looks_like_xml(text):
    """
    Check if a doctype exists or if we have some tags.
    """
    m = doctype_lookup_re.match(text)
    if m is not None:
        return True
    return tag_re.search(text) is not None

This is being reached, by the way, through each lexer's analyze_text
method - in particular, the GenshiLexer and similar - which end up
calling XmlLexer.analyze_text, which itself calls looks_like_xml.

Although I'm not sure, my best guess is that the regex ends up causing
backtracking due to the \1. While this is moderately slower for valid
XML, the true problem comes up for files that are not valid XML -
which, in the not unlikely event that what is being tested is not XML
(such as my case with Mako), is obviously going to happen:

$ time ~/regex_speed_test.py 500 safe
<_sre.SRE_Match object at 0xb7dc01d0>

real    0m0.013s
user    0m0.012s
sys     0m0.000s
$ time ~/regex_speed_test.py 500 backtracking
None

real    0m52.576s
user    0m52.451s
sys     0m0.012s

Also, note that the length of the last example isn't due to the
*nesting* of the tags - I admit 500 nested tags is unlikely in real
life. But the XML that is being tested isn't actually nested, it's a
bunch of opening tags and then non-matching closing tags (I use
<a0><a1>..</b1></b0> for the test). I think - although I can't pretend
to know enough about the re module's implementation to be sure - that
each failed closing tag is causing a backtrack. 500 total closing tags
in a file is much more plausible.

This can be fixed pretty easily by simply not using the group match at
the end of the regex, but a plain tag match instead. (Diff and HG
bundle attached.) While this does change looks_like_xml's behavior (it
now only looks for the existence of an opening and closing tag, not
for a closing tag that actually works), the test suite still passes,
and I think the severity of the problems it can cause warrant the
change.

Lastly, I owe thanks to Brodie Rao for helping to track down the
problem and writing the replacement regex.

Thanks.
Adam

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pocoo-libs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pocoo-libs?hl=en
-~----------~----~----~----~------~----~------~--~---

#!/usr/bin/env python

import re
import sys

num_tags = int(sys.argv[1])

# '<invalid><a0><a1>...</b1></b0>'
INVALID_XML = "<invalid><%s%s>" % ('><a'.join(map(str, range(num_tags))),
                                   '></b'.join(map(str, range(num_tags))))



# Pygments' implementation
backtracking_re = re.compile(r'<(.+?)(\s.*?)?>.*?</\1>(?uism)')
# Proposed replacement
safe_re =         re.compile(r'<(.+?)(\s.*?)?>.*?</.+?>(?uism)')



if sys.argv[2] == "backtracking":
    tag_re = backtracking_re
elif sys.argv[2] == "safe":
    tag_re = safe_re

print repr(tag_re.search(INVALID_XML))

backtracking_regex.bundle
Description: Binary data

diff -r d74027423177 -r af1e76e4f9e4 pygments/util.py
--- a/pygments/util.py	Fri May 09 08:20:25 2008 +0200
+++ b/pygments/util.py	Wed May 14 15:09:43 2008 -0400
@@ -20,7 +20,7 @@
      "[^"]*")
      [^>]*>
 ''')
-tag_re = re.compile(r'<(.+?)(\s.*?)?>.*?</\1>(?uism)')
+tag_re = re.compile(r'<(.+?)(\s.*?)?>.*?</.+?>(?uism)')
 
 
 class ClassNotFound(ValueError):

Pygments Backtracking Regex

Reply via email to