Adam Gomaa schrieb: > First of all, the "guest/guest" auth doesn't seem to be working on > Pygments' Trac - I'd have submitted this there if possible. > > More importantly, pygments.util.tag_re, a regex that is notably used > in pygments.util.looks_like_xml - has the potential to be extremely > slow on moderate-to-large-sized files that 'look' like XML without > being valid. I noticed this today when browsing my code repo, when > looking at a Mako file which had an unrecognized extension - Apache > spun off to 100% CPU. After duplicating locally, I originally thought > that it was an infinite loop, but some debugging revealed that it was > actually stuck in the regex search. Here's the relevant code, both > snippets from pygments/util.py > > tag_re = re.compile(r'<(.+?)(\s.*?)?>.*?</\1>(?uism)') > # ... > def looks_like_xml(text): > """ > Check if a doctype exists or if we have some tags. > """ > m = doctype_lookup_re.match(text) > if m is not None: > return True > return tag_re.search(text) is not None > > This is being reached, by the way, through each lexer's analyze_text > method - in particular, the GenshiLexer and similar - which end up > calling XmlLexer.analyze_text, which itself calls looks_like_xml. > > Although I'm not sure, my best guess is that the regex ends up causing > backtracking due to the \1. While this is moderately slower for valid > XML, the true problem comes up for files that are not valid XML - > which, in the not unlikely event that what is being tested is not XML > (such as my case with Mako), is obviously going to happen: > > $ time ~/regex_speed_test.py 500 safe > <_sre.SRE_Match object at 0xb7dc01d0> > > real 0m0.013s > user 0m0.012s > sys 0m0.000s > $ time ~/regex_speed_test.py 500 backtracking > None > > real 0m52.576s > user 0m52.451s > sys 0m0.012s > > Also, note that the length of the last example isn't due to the > *nesting* of the tags - I admit 500 nested tags is unlikely in real > life. But the XML that is being tested isn't actually nested, it's a > bunch of opening tags and then non-matching closing tags (I use > <a0><a1>..</b1></b0> for the test). I think - although I can't pretend > to know enough about the re module's implementation to be sure - that > each failed closing tag is causing a backtrack. 500 total closing tags > in a file is much more plausible. > > This can be fixed pretty easily by simply not using the group match at > the end of the regex, but a plain tag match instead. (Diff and HG > bundle attached.) While this does change looks_like_xml's behavior (it > now only looks for the existence of an opening and closing tag, not > for a closing tag that actually works), the test suite still passes, > and I think the severity of the problems it can cause warrant the > change. > > Lastly, I owe thanks to Brodie Rao for helping to track down the > problem and writing the replacement regex.
Thanks for the very detailed analysis -- I agree with you and have applied your bundle to the repo. cheers, Georg --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pocoo-libs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pocoo-libs?hl=en -~----------~----~----~----~------~----~------~--~---
