John Nagle wrote: > Note what happens when a bad declaration is found. > SGMLParser.parse_declaration > raises SGMLParseError, and the exception handler just sucks up the rest > of the > input (note that "rawdata[i:]"), treats it as unparsed data, and advances > the position to the end of input. > > That's too brutal. One bad declaration and the whole parse is messed up. > Something needs to be done at the BeautifulSoup level > to get the parser back on track. Maybe suck up input until the next ">", > treat that as data, then continue parsing from that point. That will do > the right thing most of the time, although bad declarations containing > a ">" will still be misparsed. > > How about this patch? > > except SGMLParseError: # bad decl, must recover > k = self.rawdata.find('>', i) # find next ">" > if k == -1 : # if no find > k = len(self.rawdata) # use entire string > toHandle = self.rawdata[i:k] # take up to ">" as data > self.handle_data(toHandle) # treat as data > j = i + len(toHandle) # pick up parsing after ">" > I've been testing this, and it's improved parsing considerably. Now, common lines like
<!This is an invalid comment> don't stop parsing. John Nagle -- http://mail.python.org/mailman/listinfo/python-list