John Nagle wrote:

> Note what happens when a bad declaration is found.  
> SGMLParser.parse_declaration
> raises SGMLParseError, and the exception handler just sucks up the rest 
> of the
> input (note that "rawdata[i:]"), treats it as unparsed data, and advances
> the position to the end of input.
> 
> That's too brutal.  One bad declaration and the whole parse is messed up.
> Something needs to be done at the BeautifulSoup level
> to get the parser back on track.  Maybe suck up input until the next ">",
> treat that as data, then continue parsing from that point.  That will do
> the right thing most of the time, although bad declarations containing
> a ">" will still be misparsed.
> 
> How about this patch?
> 
>             except SGMLParseError:              # bad decl, must recover
>                 k = self.rawdata.find('>', i)   # find next ">"
>                 if k == -1 :                    # if no find
>                     k = len(self.rawdata)       # use entire string
>                 toHandle = self.rawdata[i:k]    # take up to ">" as data
>                 self.handle_data(toHandle)      # treat as data
>                 j = i + len(toHandle)           # pick up parsing after ">"
> 
    I've been testing this, and it's improved parsing considerably.  Now,
common lines like

        <!This is an invalid comment>

don't stop parsing.

                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to