Re: Python and decimal character entities over 128.

Marc 'BlackJack' Rintsch Wed, 09 Jul 2008 22:11:30 -0700

On Wed, 09 Jul 2008 16:39:24 -0700, bsagert wrote:

> Some web feeds use decimal character entities that seem to confuse
> Python (or me).


I guess they confuse you.  Python is fine.

> For example, the string "doesn't" may be coded as "doesn&#8217;t" which
> should produce a right leaning apostrophe. Python hates decimal entities
> beyond 128 so it chokes unless you do something like
> string.encode('utf-8').

Python doesn't hate nor chokes on these entities.  It just refuses to
guess which encoding you want, if you try to write *unicode* objects into 
a file.  Files contain byte values not characters.

> Even then, what should have been a right-leaning apostrophe ends up as
> "â€™". The following script does just that. Look for the string "The
> Canuck iPhone: Apple doesnâ €™t care" after running it.

Then you didn't tell the application you used to look at the result, that
the text is UTF-8 encoded. I guess you are using Windows and
the application expects cp1252 encoded text because an UTF-8 encoded
apostrophe looks like 'â€™' in cp1252.

Choose the encoding you want the result to have and anything is fine. 
Unless you stumble over a feed using characters which can't be encoded
in the encoding of your choice.  That's why UTF-8 might have been a good
idea.

Ciao,
        Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python and decimal character entities over 128.

Reply via email to