Python HTML parser chokes on UTF-8 input

Johannes Bauer Thu, 09 Oct 2008 14:00:51 -0700

Hello group,

I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:


prs = self.parserclass(formatter.NullFormatter())
prs.init()
prs.feed(website)
self.__result = prs.get()
prs.close()

Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:

website = website.replace(u"föö", u"bär")

Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:

website = website.decode("latin1")
website = website.replace(u"föö", u"bär")
website = website.encode("latin1")

This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input. However when I omit the reecoding to latin1:

  File "CachedWebParser.py", line 13, in __init__
    self.__process(website)
  File "CachedWebParser.py", line 55, in __process
    prs.feed(website)
  File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)

Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.

Can I do something about it?

Regards,
Johannes

-- 
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
         -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
                         <[EMAIL PROTECTED]>
--
http://mail.python.org/mailman/listinfo/python-list

Python HTML parser chokes on UTF-8 input

Reply via email to