Robin Haswell wrote: > Could someone explain to me what I'm doing wrong here, so I can hope to > throw light on the myriad of similar problems I'm having? Thanks :-) > > Python 2.4.1 (#2, May 6 2005, 11:22:24) > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import sys > >>> sys.getdefaultencoding() > 'utf-8'
that's bad. do not hack the default encoding. it'll only make you sorry when you try to port your code to some other python installation, or use a library that relies on the factory settings being what they're supposed to be. do not hack the default encoding. back to your code: > >>> import htmlentitydefs > >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a > >>> copyright symbol > >>> print char > © that's a standard (8-bit) string: >>> type(char) <type 'str'> >>> ord(char) 169 >>> len(char) 1 one byte that contains the value 169. looks like ISO-8859-1 (Latin-1) to me. let's see what the documentation says: entitydefs A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1. alright, so it's an ISO Latin-1 string. > >>> str = u"Apple" > >>> print str > Apple >>> type(str) <type 'unicode'> >>> len(str) 5 that's a 5-character unicode string. > >>> str + char > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: > unexpected code byte you're trying to combine an 8-bit string with a Unicode string, and you've told Python (by hacking the site module) to treat all 8-bit strings as if they contain UTF-8. UTF-8 != ISO-Latin-1. so, you can of course convert the string you got from the entitydefs dict to a unicode string before you combine the two strings >>> unicode(char, "iso-8859-1") + str u'\xa9Apple' but the htmlentitydefs module offers a better alternative: name2codepoint A dictionary that maps HTML entity names to the Unicode codepoints. New in version 2.3. which allows you to do >>> char = unichr(htmlentitydefs.name2codepoint["copy"]) >>> char u'\xa9' >>> char + str u'\xa9Apple' without having to deal with things like >>> len(htmlentitydefs.entitydefs["copy"]) 1 >>> len(htmlentitydefs.entitydefs["rarr"]) 7 > Basically my app is a search engine - I'm grabbing content from pages > using HTMLParser and storing it in a database but I'm running in to these > problems all over the shop (from decoding the entities to calling > str.lower()) - I don't know what encoding my pages are coming in as, I'm > just happy enough to accept that they're either UTF-8 or latin-1 with > entities. UTF-8 and Latin-1 are two different things, so your (international) users will hate you if you don't do this right. > It's even worse that I've written the same app in PHP before with none of > these problems - and PHP4 doesn't even support Unicode. a PHP4 application without I18N problems? I'm not sure I believe you... ;-) </F>
-- http://mail.python.org/mailman/listinfo/python-list