In article <[EMAIL PROTECTED]>, Christian Sonne <[EMAIL PROTECTED]> wrote:
> Thanks to all of you for your replies - they have been most helpful, and > my program is now running at a reasonable pace... > > > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it > turns out to misbehave in further testing, I'll know where to turn :-P Anything with variable-length wildcard matching (*+?) is going to drag your performance down. There was an earlier thread on this very topic. Another stupid question is how are you planning on handling ISBNs formatted with hyphens for readability? In general I've found the following factors to be critical in getting good performance from re: 1: Reducing the number of times you call re.match or re.search. 2: Reducing the number of bytes that re has to search through. 3: Minimizing the use of wildcards in the expression. If you can pre-filter your input with string.find before running re.match you will improve performance quite a bit compared to running re expressions over all 10 pages. I played around a bit and attached some example code below searching over 21K of text for the ISBN number. testPrefilter() runs about 1/5th the execution time of line-by-line re calls or a single re call over a 21K string. Interestingly this ratio scales up to something as big as Mobey Dick. The searchLabels() functions below beats all the other functions by searching for "ISBN", or "International Book" and then using RE on the surrounding 500 bytes. You might also try searching for "Copyright" or "Library of Congress" since most modern books will have it all on the same page. A caveat here is that this only works if you can find a reasonably unique string at or near what you want to find with re. If you need to run re.search on every byte of the file anyway, this isn't going to help. --------------- timing test code --------------- #!/usr/bin/env python from timeit import Timer import re textString = """The text of a sample page using with an ISBN 10 number ISBN 0672328976 and some more text to compare.""" #add the full text of Mobey Dick to make the search functions #work for their bread. fileHandle= open("./mobey.txt") junkText = fileHandle.readlines() junkText.append(textString) textString=''.join(junkText) #print textString #compile the regex isbn10Re = re.compile(r"\b\d{9}[0-9X]\b") def testPrefilter(): """Work through a pre-loaded array running re only on lines containing ISBN""" for line in junkText: #search for 'ISBN" if (line.find('ISBN') > -1): thisMatch = isbn10Re.search(line) if thisMatch: return thisMatch.group(0) def testNofilter(): """Run re.search on every line.""" for line in junkText: #seaching using RE thisMatch = isbn10Re.search(line) if thisMatch: return thisMatch.group(0) def testFullre(): """Run re.search on a long text string.""" thisMatch = isbn10Re.search(textString) if thisMatch: return thisMatch.group(0) def searchLabels(): #identify some text that might be near an ISBN number. isbnLabels=["ISBN", "International Book"] #use the fast string.find method to locate those #labels in the text isbnIndexes = [textString.find(x) for x in isbnLabels] #exclude labels not found in the text. isbnIndexes = [y for y in isbnIndexes if y > -1] #run re.search on a 500 character string around the text label. for x in isbnIndexes: thisMatch=isbn10Re.search(textString[x-250:x+250]) return thisMatch.group(0) #print searchLabels() #print testPrefilter() #print testNofilter() t = Timer("testNofilter()","from __main__ import testNofilter") print t.timeit(100) u = Timer("testPrefilter()","from __main__ import testPrefilter") print u.timeit(100) v = Timer("testFullre()","from __main__ import testFullre") print v.timeit(100) w = Timer("searchLabels()", "from __main__ import searchLabels") print w.timeit(100) -- http://mail.python.org/mailman/listinfo/python-list