Re: Pygments Backtracking Regex

Georg Brandl Fri, 23 May 2008 03:36:07 -0700

Adam Gomaa schrieb:
> First of all, the "guest/guest" auth doesn't seem to be working on
> Pygments' Trac - I'd have submitted this there if possible.
> 
> More importantly, pygments.util.tag_re, a regex that is notably used
> in pygments.util.looks_like_xml - has the potential to be extremely
> slow on moderate-to-large-sized files that 'look' like XML without
> being valid. I noticed this today when browsing my code repo, when
> looking at a Mako file which had an unrecognized extension - Apache
> spun off to 100% CPU. After duplicating locally, I originally thought
> that it was an infinite loop, but some debugging revealed that it was
> actually stuck in the regex search. Here's the relevant code, both
> snippets from pygments/util.py
> 
> tag_re = re.compile(r'<(.+?)(\s.*?)?>.*?</\1>(?uism)')
> # ...
> def looks_like_xml(text):
>     """
>     Check if a doctype exists or if we have some tags.
>     """
>     m = doctype_lookup_re.match(text)
>     if m is not None:
>         return True
>     return tag_re.search(text) is not None
> 
> This is being reached, by the way, through each lexer's analyze_text
> method - in particular, the GenshiLexer and similar - which end up
> calling XmlLexer.analyze_text, which itself calls looks_like_xml.
> 
> Although I'm not sure, my best guess is that the regex ends up causing
> backtracking due to the \1. While this is moderately slower for valid
> XML, the true problem comes up for files that are not valid XML -
> which, in the not unlikely event that what is being tested is not XML
> (such as my case with Mako), is obviously going to happen:
> 
> $ time ~/regex_speed_test.py 500 safe
> <_sre.SRE_Match object at 0xb7dc01d0>
> 
> real    0m0.013s
> user    0m0.012s
> sys     0m0.000s
> $ time ~/regex_speed_test.py 500 backtracking
> None
> 
> real    0m52.576s
> user    0m52.451s
> sys     0m0.012s
> 
> Also, note that the length of the last example isn't due to the
> *nesting* of the tags - I admit 500 nested tags is unlikely in real
> life. But the XML that is being tested isn't actually nested, it's a
> bunch of opening tags and then non-matching closing tags (I use
> <a0><a1>..</b1></b0> for the test). I think - although I can't pretend
> to know enough about the re module's implementation to be sure - that
> each failed closing tag is causing a backtrack. 500 total closing tags
> in a file is much more plausible.
> 
> This can be fixed pretty easily by simply not using the group match at
> the end of the regex, but a plain tag match instead. (Diff and HG
> bundle attached.) While this does change looks_like_xml's behavior (it
> now only looks for the existence of an opening and closing tag, not
> for a closing tag that actually works), the test suite still passes,
> and I think the severity of the problems it can cause warrant the
> change.
> 
> Lastly, I owe thanks to Brodie Rao for helping to track down the
> problem and writing the replacement regex.


Thanks for the very detailed analysis -- I agree with you and have
applied your bundle to the repo.

cheers,
Georg


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pocoo-libs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pocoo-libs?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Pygments Backtracking Regex

Reply via email to