Comment #3 on issue 189 by na...@animats.com: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189

This is clearly a defect. This is an object-oriented library in an object oriented language. Two parsers should be completely independent of each other, with no shared global variables, and thus thread-safe. If that's not the case, this is a defect.

Do I have to scrap my plans to convert a parallel web crawler from BeautifulSoup to html5lib?

This looks fixable. The trouble spots include at least these global variables:

dom.py: moduleCache

That could be easily fixed with a lock in getDomModule. That's a once per parse event, so there's no performance issue. All that's needs is

import threading
...
Lok = threading.Lock()

with Lok() :
  ... critical section...


etree.py: moduleCache

Same issue.

etree.lxml: fullTree

This seems to be set only once, at load time. Is it changed elsewhere?

what have I missed? Some lower level library? Is Python's SAX parser unsafe?

This can and should be fixed.




--
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss@googlegroups.com.
To unsubscribe from this group, send email to 
html5lib-discuss+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB.

Reply via email to