Re: Python HTML parser chokes on UTF-8 input

John Nagle Fri, 17 Oct 2008 08:35:40 -0700

Johannes Bauer wrote:

Hello group,


I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:


   Try BeautifulSoup.  It actually understands how to detect the encoding
of an HTML file (there are three different ways that information can be
expressed), and will shift modes accordingly.

   This is an ugly problem.  Sometimes, it's necessary to parse part of
the file, discover that the rest of the file has a non-ASCII encoding,
and restart the parse from the beginning.  BeautifulSoup has the
machinery for that.

                                John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python HTML parser chokes on UTF-8 input

Reply via email to