Paul Rubin, 04.02.2010 02:51: > John Nagle writes: >> Analysis of each domain is >> performed in a separate process, but each process uses multiple >> threads to read process several web pages simultaneously. >> >> Some of the threads go compute-bound for a second or two at a time as >> they parse web pages. > > You're probably better off using separate processes for the different > pages. If I remember, you were using BeautifulSoup, which while very > cool, is pretty doggone slow for use on large volumes of pages. I don't > know if there's much that can be done about that without going off on a > fairly messy C or C++ coding adventure. Maybe someday someone will do > that.
Well, if multi-core performance is so important here, then there's a pretty simple thing the OP can do: switch to lxml. http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Stefan -- http://mail.python.org/mailman/listinfo/python-list