On 10/24/2010 11:44 PM, Stefan Behnel wrote:
josh logan, 25.10.2010 04:14:
I found the error. The HTML file I'm parsing has invalid HTML at line
193. It has something like:

<a href="mystuff "class = "stuff">

Note there is no space between the closing quote for the "href" tag
and the class attribute. I guess I'll go through each file and correct
these issues as I parse them.

HTMLparser is not made to deal with non-HTML input. You can take a look
at lxml.html or BeautifulSoup (up to 3.0), which handle these problems a
lot better.

Stefan

   You might try HTML5lib:

        http://code.google.com/p/html5lib/

The HTML 5 spec formalizes the concept of "bad HTML".  Really. There's
a specified way to parse the most common HTML errors.  Most browsers
are far more tolerant of bad HTML than they should be, and not in a
consistent way.  The HTML 5 spec tries to fix that.

   I use BeautifulSoup, but it's being abandoned for the Python 3
transition.
"http://www.crummy.com/software/BeautifulSoup/3.1-problems.html";

                                John Nagle

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to