Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this:
prs = self.parserclass(formatter.NullFormatter()) prs.init() prs.feed(website) self.__result = prs.get() prs.close() Now when I take "website" directly from the parser, everything is fine. However I want to do some modifications before I parse it, namely UTF-8 modifications in the style: website = website.replace(u"föö", u"bär") Therefore, after fetching the web site content, I have to convert it to UTF-8 first, modify it and convert it back: website = website.decode("latin1") website = website.replace(u"föö", u"bär") website = website.encode("latin1") This is incredibly ugly IMHO, as I would really like the parser to just accept UTF-8 input. However when I omit the reecoding to latin1: File "CachedWebParser.py", line 13, in __init__ self.__process(website) File "CachedWebParser.py", line 55, in __process prs.feed(website) File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) Annoying, IMHO, that the internal html Parser cannot cope with UTF-8 input - which should (again, IMHO) be the absolute standard for such a new language. Can I do something about it? Regards, Johannes -- "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie." -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik <[EMAIL PROTECTED]> -- http://mail.python.org/mailman/listinfo/python-list