On Wed, 09 Jul 2008 16:39:24 -0700, bsagert wrote: > Some web feeds use decimal character entities that seem to confuse > Python (or me).
I guess they confuse you. Python is fine. > For example, the string "doesn't" may be coded as "doesn’t" which > should produce a right leaning apostrophe. Python hates decimal entities > beyond 128 so it chokes unless you do something like > string.encode('utf-8'). Python doesn't hate nor chokes on these entities. It just refuses to guess which encoding you want, if you try to write *unicode* objects into a file. Files contain byte values not characters. > Even then, what should have been a right-leaning apostrophe ends up as > "’". The following script does just that. Look for the string "The > Canuck iPhone: Apple doesnâ €™t care" after running it. Then you didn't tell the application you used to look at the result, that the text is UTF-8 encoded. I guess you are using Windows and the application expects cp1252 encoded text because an UTF-8 encoded apostrophe looks like '’' in cp1252. Choose the encoding you want the result to have and anything is fine. Unless you stumble over a feed using characters which can't be encoded in the encoding of your choice. That's why UTF-8 might have been a good idea. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list