Tim Peters added the comment:

There's actually enormous backtracking here.  Try this much shorter regexp and 
you'll see much the same behavior:

re_utf8 = r'^([\x00-\x7f]+)*$'

That's the original re_utf8 with all but the first alternative removed.

Looks like passing s[0:34] "works" because it eliminates the trailing \x8d that 
prevents the regexp from matching the whole string.  Because the regexp cannot 
match the whole string, it takes a very long time to try all the futile 
combinations implied by the nested quantifiers.  As the much simpler re_utf8 
above shows, it's not the alternatives in the regexp that matter here, it's the 
nested quantifiers.

nosy: +tim_one

Python tracker <rep...@bugs.python.org>
Python-bugs-list mailing list

Reply via email to