On Jul 22, 7:56 am, Gilles Ganault <[EMAIL PROTECTED]> wrote: > On Sat, 21 Jul 2007 22:18:56 -0400, Carsten Haese > > <[EMAIL PROTECTED]> wrote: > >That's your problem right there. RE is not the right tool for that job. > >Use an actual HTML parser such as BeautifulSoup > > Thanks a lot for the tip. I tried it, and it does look interesting, > although I've been unsuccessful using a regex with BS to find all > occurences of the pattern. > > Incidently, as far as using Re alone is concerned, it appears that > re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE > must be added. > > Problem is, when I add re.DOTLINE, the search takes less than a second > for a 500KB file... and about 1mn30 for a file that's 1MB, with both > files holding similar contents. > > Why such a huge difference in performance? > > pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"
That .*? can really slow it down if the following pattern can't be found. It may end up looking until the end of the file for proper continuation of the pattern and fail, and then start again. Without DOTALL it would only look until the end of the line so performance would stay bearable. Your 1.5MB file might have for example '<span class=defaut>13:34< /span>'*10000 as its contents. Because the < /span> doesn't match </span>, it would end up looking till the end of the file for </span> and not finding it. And then move on to the next occurence of '<span class=...' and see if it has better luck finding a pattern there. That's an example of a situation where the pattern matcher would become very slow. I'd have to see the 1.5MB file's contents to better guess what goes wrong. If the span's contents don't have nested elements (like <i></i>), you could maybe use negated char range: "<span class=.?default.?>(\d+:\d+)[^<]*</span>" This pattern should be very fast for all inputs because the [^<]* can't match stuff indefinitely until the end of the file - only until the next HTML element comes around. Or if you don't care about anything but those numbers, you should just match this: "<span class=.?default.?>(\d+:\d+)" -- http://mail.python.org/mailman/listinfo/python-list