John Nagle, 11.03.2012 21:30: > "html5lib" is apparently not thread safe. > (see "http://code.google.com/p/html5lib/issues/detail?id=189") > Looking at the code, I've only found about three problems. > They're all the usual "cached in a global without locking" bug. > A few locks would fix that. > > But html5lib calls the XML SAX parser. Is that thread-safe? > Or is there more trouble down at the bottom? > > (I run a multi-threaded web crawler, and currently use BeautifulSoup, > which is thread safe, although dated. I'm looking at converting to > html5lib.)
You may also consider moving to lxml. BeautifulSoup supports it as a parser backend these days, so you wouldn't even have to rewrite your code to use it. And performance-wise, well ... http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Stefan -- http://mail.python.org/mailman/listinfo/python-list