Hm... Perhaps the xml-sig would be a better place to discuss this? On 6/11/06, Thomas Broyer <[EMAIL PROTECTED]> wrote: > Hi, > > First, sorry for posting this here, I closed my SourceForge account a > few months ago and I can't get it reopened... > > I'm using python 2.2.1 but a diff on SVN showed that there was no > change at this level, so the following bug should still be there in > current versions (I'll try with a 2.4 at work tomorrow). On my > machine, xml.sax.make_parser returns an > xml.sax.expatreader.ExpatParser instance. > > The problem is: I'm never given END_DOCUMENT events. > > Code to reproduce: > > from xml.dom.pulldom import parseString > reader = parseString('<element attribute="value">text</element>') > # The following 2 lines will produce, in order: > # START_DOCUMENT, START_ELEMENT, TEXT, END_ELEMENT > # Note the lack of the END_DOCUMENT event. > for event,node in reader: > print event > # The following line will get an END_DOCUMENT event > print reader.getEvent()[0] > # The following line will throw a SAXParseException, > # because the SAX parser's close method has been > # called twice > print reader.getEvent()[0] > > > Cause: > > The xml.dom.pulldom.DOMEventStream.getEvent method, when it has no > more event in its internal stack, calls the SAX parser's close() > method (which is OK) then immediately returns 'None', ignoring any > event that could have been generated by the call to the close() > method. If you call getEvent later, it will send you the remaining > events until there are no more left, and then will call the SAX > parser's close() method again, causing a SAXParseException. > Because expat (an maybe other parsers too) has no way to know when the > document ends, it generates the endDocument/END_DOCUMENT event only > when explicitely told that the XML chunk is the final one (i.e. when > the close() method is called) > > > Proposed fix: > > Add a "parser_closed" attribute to the DOMEventStream class, > initialized to "False". After having called self.parser.close() in the > xml.dom.pulldom.DOMEventStream.getEvent method, immediately set this > "parser_closed" attribute to True and proceed. Finally, at the > beginning of the "while" loop, immediately returns "None" if > "parser_closed" is "True" to prevent a second call to > self.parser.close(). > With this change, any call to getEvent when there are no event left > will return None and never throw an exception, which I think is the > expected behavior. > > > Proposed code: > > The "closed" attribute is initialized in the "__init__" method: > def __init__(self, stream, parser, bufsize): > self.stream = stream > self.parser = parser > self.parser_closed = False > self.bufsize = bufsize > if not hasattr(self.parser, 'feed'): > self.getEvent = self._slurp > self.reset() > > The "getEvent" method becomes: > def getEvent(self): > # use IncrementalParser interface, so we get the desired > # pull effect > if not self.pulldom.firstEvent[1]: > self.pulldom.lastEvent = self.pulldom.firstEvent > while not self.pulldom.firstEvent[1]: > if self.parser_closed: > return None > buf = self.stream.read(self.bufsize) > if buf: > self.parser.feed(buf) > else: > self.parser.close() > self.parser_closed = True > rc = self.pulldom.firstEvent[1][0] > self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1] > return rc > > The same problem seems to exist in the > xml.dom.pulldom.DOMEventStream._slurp method, when the SAX parser is > not an IncrementalParser, as the parser's close() method is never > called. I suggest adding a call to the close() method in there. > However, as I only have expat as an option, which implements > IncrementalParser, I can't test it... > The _slurp method would become: > def _slurp(self): > """ Fallback replacement for getEvent() using the > standard SAX2 interface, which means we slurp the > SAX events into memory (no performance gain, but > we are compatible to all SAX parsers). > """ > self.parser.parse(self.stream) > self.parser.close() > self.getEvent = self._emit > return self._emit() > The _emit method raises exceptions when there are no events left, so I > propose changing it to: > def _emit(self): > """ Fallback replacement for getEvent() that emits > the events that _slurp() read previously. > """ > if not self.pulldom.firstEvent[1]: > return None > rc = self.pulldom.firstEvent[1][0] > self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1] > return rc > > Hope this helps. > > -- > Thomas Broyer > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/guido%40python.org >
-- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com