In article <37da38d2-09a8-4fd2-94b4-5feae9675...@k1g2000yqf.googlegroups.com>, Filip <pink...@gmail.com> wrote: > >I tried to fix that with BeautifulSoup + regexp filtering of some >particular cases I encountered. That was slow and after running my >data scraper for some time a lot of new problems (exceptions from >xpath parser) were showing up. Not to mention that BeautifulSoup >stripped almost all of the content from some heavily broken pages >(50+KiB page stripped down to some few hundred bytes). Character >encoding conversion was a hell too - even UTF-8 pages had some non- >standard characters causing issues.
Have you tried lxml? -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ "At Resolver we've found it useful to short-circuit any doubt and just refer to comments in code as 'lies'. :-)" --Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22 -- http://mail.python.org/mailman/listinfo/python-list