En Sun, 25 Feb 2007 05:21:49 -0300, Christian Sonne <[EMAIL PROTECTED]> escribió:
> Long story short, I'm trying to find all ISBN-10 numbers in a multiline > string (approximately 10 pages of a normal book), and as far as I can > tell, the *correct* thing to match would be this: > ".*\D*(\d{10}|\d{9}X)\D*.*" Why the .* at the start and end? You dont want to match those, and makes your regexp slow. You didn't tell how exactly a ISBN-10 number looks like, but if you want to match 10 digits, or 9 digits followed by an X: reISBN10 = re.compile("\d{10}|\d{9}X") That is, just the () group in your expression. But perhaps this other one is better (I think it should be faster, but you should measure it): reISBN10 = re.compile("\d{9}[\dX]") ("Nine digits followed by another digit or an X") > if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk > loosing results, but it runs in about 0.3 seconds Using my suggested expressions you might match some garbage, but not loose anything (except two ISBN numbers joined together without any separator in between). Assuming you have stripped all the "-", as you said. > So what's the deal? - why would it take so long to run the correct one? > - especially when a slight modification makes it run as fast as I'd > expect from the beginning... Those .* make the expression match a LOT of things at first, just to discard it in the next step. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list