Comment #3 on issue 189 by na...@animats.com: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189
This is clearly a defect. This is an object-oriented library in an object
oriented language. Two parsers should be completely independent of each
other, with no shared global variables, and thus thread-safe. If that's not
the case, this is a defect.
Do I have to scrap my plans to convert a parallel web crawler from
BeautifulSoup to html5lib?
This looks fixable. The trouble spots include at least these global
variables:
dom.py: moduleCache
That could be easily fixed with a lock in getDomModule. That's a once per
parse event, so there's no performance issue. All that's needs is
import threading
...
Lok = threading.Lock()
with Lok() :
... critical section...
etree.py: moduleCache
Same issue.
etree.lxml: fullTree
This seems to be set only once, at load time. Is it changed elsewhere?
what have I missed? Some lower level library? Is Python's SAX parser
unsafe?
This can and should be fixed.
--
You received this message because you are subscribed to the Google Groups
"html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss@googlegroups.com.
To unsubscribe from this group, send email to
html5lib-discuss+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/html5lib-discuss?hl=en-GB.