John Nagle wrote:
> Note what happens when a bad declaration is found.
> SGMLParser.parse_declaration
> raises SGMLParseError, and the exception handler just sucks up the rest
> of the
> input (note that "rawdata[i:]"), treats it as unparsed data, and advances
> the position to the end of input.
>
> That's too brutal. One bad declaration and the whole parse is messed up.
> Something needs to be done at the BeautifulSoup level
> to get the parser back on track. Maybe suck up input until the next ">",
> treat that as data, then continue parsing from that point. That will do
> the right thing most of the time, although bad declarations containing
> a ">" will still be misparsed.
>
> How about this patch?
>
> except SGMLParseError: # bad decl, must recover
> k = self.rawdata.find('>', i) # find next ">"
> if k == -1 : # if no find
> k = len(self.rawdata) # use entire string
> toHandle = self.rawdata[i:k] # take up to ">" as data
> self.handle_data(toHandle) # treat as data
> j = i + len(toHandle) # pick up parsing after ">"
>
I've been testing this, and it's improved parsing considerably. Now,
common lines like
<!This is an invalid comment>
don't stop parsing.
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list