BeautifulSoup vs. real-world HTML comments

John Nagle Wed, 04 Apr 2007 11:11:06 -0700

    The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands.  I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers.  Here's yet another example, this one from
"http://www.webdirectory.com";.  The page starts like this:



        <!Hello there! Welcome to The Environment Directory!>
        <!Not too much exciting HTML code here but it does the job! >
        <!See ya, - JD >

        <HTML><HEAD>
        <TITLE>Environment Web Directory</TITLE>


Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk.  It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.


                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

BeautifulSoup vs. real-world HTML comments

Reply via email to