Re: HTMLParser fragility

John J. Lee Mon, 10 Apr 2006 12:44:30 -0700

"Lawrence D'Oliveiro" <[EMAIL PROTECTED]> writes:

> I've been using HTMLParser to scrape Web sites. The trouble with this 
> is, there's a lot of malformed HTML out there. Real browsers have to be 
> written to cope gracefully with this, but HTMLParser does not. Not only 
> does it raise an exception, but the parser object then gets into a 
> confused state after that so you cannot continue using it.
[...]


sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.

BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.


John

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTMLParser fragility

Reply via email to