On 07/02/2008, Alexander Harrowell <[EMAIL PROTECTED]> wrote: > To clarify, I use BeautifulSoup for a small project that parses frequently > changing HTML on a number of websites (>1MB each), extracts the content of > specific tags, filters out certain strings from the content, and serves it > up in a consistent format. The input HTML comes from the wild, and often > contains odd tags, funny characters, and other inconsistencies. > > It has so far worked near-perfectly for the last 9 months. Speed appears to > be a conventional problem with BS, which is why I mentioned it, but when I > analysed the code in an effort to speed it up I discovered that 90%+ of the > time taken was accounted for by network latency in getting the data from the > remote sites. >
FWIW, we parse tens of thousands of pages every week to build let people republish content into nice PDFs. Beautiful Soup was the only thing that made this sane, as many pages are not structured to be easy to parse. Like you we found the network was the limit, and simply kicking off several scraping processes in parallel solved that (e.g. one run of a script parses hotels from A-F, the next from G-M and so on...). I can't imagine using anything else. Best Regards, -- Andy Robinson CEO/Chief Architect ReportLab Europe Ltd. 165 The Broadway, Wimbledon, London SW19 1NE, UK Tel +44-20-8544-8049 _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk