Hi, First, sorry for posting this here, I closed my SourceForge account a few months ago and I can't get it reopened...
I'm using python 2.2.1 but a diff on SVN showed that there was no change at this level, so the following bug should still be there in current versions (I'll try with a 2.4 at work tomorrow). On my machine, xml.sax.make_parser returns an xml.sax.expatreader.ExpatParser instance. The problem is: I'm never given END_DOCUMENT events. Code to reproduce: from xml.dom.pulldom import parseString reader = parseString('<element attribute="value">text</element>') # The following 2 lines will produce, in order: # START_DOCUMENT, START_ELEMENT, TEXT, END_ELEMENT # Note the lack of the END_DOCUMENT event. for event,node in reader: print event # The following line will get an END_DOCUMENT event print reader.getEvent()[0] # The following line will throw a SAXParseException, # because the SAX parser's close method has been # called twice print reader.getEvent()[0] Cause: The xml.dom.pulldom.DOMEventStream.getEvent method, when it has no more event in its internal stack, calls the SAX parser's close() method (which is OK) then immediately returns 'None', ignoring any event that could have been generated by the call to the close() method. If you call getEvent later, it will send you the remaining events until there are no more left, and then will call the SAX parser's close() method again, causing a SAXParseException. Because expat (an maybe other parsers too) has no way to know when the document ends, it generates the endDocument/END_DOCUMENT event only when explicitely told that the XML chunk is the final one (i.e. when the close() method is called) Proposed fix: Add a "parser_closed" attribute to the DOMEventStream class, initialized to "False". After having called self.parser.close() in the xml.dom.pulldom.DOMEventStream.getEvent method, immediately set this "parser_closed" attribute to True and proceed. Finally, at the beginning of the "while" loop, immediately returns "None" if "parser_closed" is "True" to prevent a second call to self.parser.close(). With this change, any call to getEvent when there are no event left will return None and never throw an exception, which I think is the expected behavior. Proposed code: The "closed" attribute is initialized in the "__init__" method: def __init__(self, stream, parser, bufsize): self.stream = stream self.parser = parser self.parser_closed = False self.bufsize = bufsize if not hasattr(self.parser, 'feed'): self.getEvent = self._slurp self.reset() The "getEvent" method becomes: def getEvent(self): # use IncrementalParser interface, so we get the desired # pull effect if not self.pulldom.firstEvent[1]: self.pulldom.lastEvent = self.pulldom.firstEvent while not self.pulldom.firstEvent[1]: if self.parser_closed: return None buf = self.stream.read(self.bufsize) if buf: self.parser.feed(buf) else: self.parser.close() self.parser_closed = True rc = self.pulldom.firstEvent[1][0] self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1] return rc The same problem seems to exist in the xml.dom.pulldom.DOMEventStream._slurp method, when the SAX parser is not an IncrementalParser, as the parser's close() method is never called. I suggest adding a call to the close() method in there. However, as I only have expat as an option, which implements IncrementalParser, I can't test it... The _slurp method would become: def _slurp(self): """ Fallback replacement for getEvent() using the standard SAX2 interface, which means we slurp the SAX events into memory (no performance gain, but we are compatible to all SAX parsers). """ self.parser.parse(self.stream) self.parser.close() self.getEvent = self._emit return self._emit() The _emit method raises exceptions when there are no events left, so I propose changing it to: def _emit(self): """ Fallback replacement for getEvent() that emits the events that _slurp() read previously. """ if not self.pulldom.firstEvent[1]: return None rc = self.pulldom.firstEvent[1][0] self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1] return rc Hope this helps. -- Thomas Broyer _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com