Bugs item #1737127, was opened at 2007-06-14 12:05 Message generated for change (Comment added) made by gbrandl You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1737127&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Regular Expressions Group: None >Status: Pending Resolution: None Priority: 5 Private: No Submitted By: Arno Bakker (abakker) Assigned to: Gustavo Niemeyer (niemeyer) Summary: re.findall hangs python completely Initial Comment: Running a re.findall() on 40 KB of HTML appears to hang python completely. It hogs the CPU (perhaps not unexpected) but other python threads do not continue and pressing Ctrl-C does not trigger a KeyboardInterrupt. Only a SIGQUIT (Ctrl-\) can kill it. Attached is a small script to illustrate the problem, and the data file that causes it to hang. Using 40 KB of random data does let it get past the first findall. It creates a Thread that should printout hashes continuously, however, as soon as the MainThread hits the findall the printing stops. Occurs on Python-2.4.4 (direct from www.python.org) and 2.5.1 (2.5.1-0ubuntu1 from Feisty) ---------------------------------------------------------------------- >Comment By: Georg Brandl (gbrandl) Date: 2007-06-19 12:44 Message: Logged In: YES user_id=849994 Originator: NO This is quite normal for regular expressions with a lot of backtracking permutations to try, and a big string to search in. You should try to optimize your REs -- wrt. the threads, re doesn't release the GIL while searching, that's another bug report. ---------------------------------------------------------------------- Comment By: Gregory Smith (gregsmith) Date: 2007-06-18 17:23 Message: Logged In: YES user_id=292741 Originator: NO First off, don't expect other threads to run during re execution. Multi-threading in python is mainly to allow one thread to run while the others are waiting for I/O or doing a time.sleep() or something specific like that. Switching between runnable threads only occurs in interpreter loop. There may exceptions to allow switching during some really long core operations (a mutex needs to be released and taken again) but it has to be done under certain conditions so that threads won't mess each other's data up. So, on to the r.e.: first, try changing all the .*? to just .* -- the ? is redundant and may be increasing the runtime by expanding the number of permutations that are being tried. But I think your real trouble is all of these : img src=\"(.*?)\" This allows the second " to match with anything at all between, including any number of quoted strings. Your combination of several of these may be causing the RE engine to spend a huge amount of time looking at many different combinations for the first few .*?, all of which fail by the time you get to the last one. Try img src=\"([^"]*)\" instead; this will only match the pair of " with no " in between. Likewise, in .*?> the .* will match any number of '>' chars if this is needed to make the whole thing match, which is probably not what you want. You might get it to work just by turning off 'greedy' matching for '*'. ---------------------------------------------------------------------- Comment By: Arno Bakker (abakker) Date: 2007-06-14 12:06 Message: Logged In: YES user_id=216477 Originator: YES File Added: page.html ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1737127&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com