Re: inserting Unicode character in dictionary - Python

Joe Strout Sun, 19 Oct 2008 05:58:43 -0700

On Oct 18, 2008, at 1:20 AM, Martin v. Löwis wrote:

Do you then have a proper UTF-8 string,
but the problem is that none of the standard Python library methodsknow
how to properly interpret UTF-8?


There is (probably) no such thing as a "proper UTF-8 string" (in the
sense in which you probably mean it).

To be clear, I mean a string that is valid UTF-8 (not all strings ofbytes are, of course).

Python doesn't have a data type
for "UTF-8 string". It only has a data type "byte string". It's up to
the application whether it gets interpreted in a consistent manner.
Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
encoded strings the same way as for, say, Big-5 encoded strings.

Oi -- so if I ask for length, I get the number of bytes, not thenumber of characters. If I slice and dice, I could end up splittingcharacters in half. It is, as you say, just a string of bytes, not astring of characters.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.
You still need to make sure that the editor's encoding and thedeclared
encoding match.

Well, the if no encoding is declared, it (quite sensibly) assumesUTF-8, so for my purposes this boils down to using a UTF-8 editor --which I always do anyway. But do I still have to put a "u" before mystring literals in order to have it treated as characters rather thanbytes?

I'm hoping that the answer is "no" -- most string literals in a sourcefile are text (which should be Unicode text, these days); a raw bytestring would be the exceptional case, and I'd be happy to use the "r"prefix for those.


Best,
- Joe

--
http://mail.python.org/mailman/listinfo/python-list

Re: inserting Unicode character in dictionary - Python

Reply via email to