On Oct 18, 2008, at 1:20 AM, Martin v. Löwis wrote:
Do you then have a proper UTF-8 string,
but the problem is that none of the standard Python library methods
know
how to properly interpret UTF-8?
There is (probably) no such thing as a "proper UTF-8 string" (in the
sense in which you probably mean it).
To be clear, I mean a string that is valid UTF-8 (not all strings of
bytes are, of course).
Python doesn't have a data type
for "UTF-8 string". It only has a data type "byte string". It's up to
the application whether it gets interpreted in a consistent manner.
Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
encoded strings the same way as for, say, Big-5 encoded strings.
Oi -- so if I ask for length, I get the number of bytes, not the
number of characters. If I slice and dice, I could end up splitting
characters in half. It is, as you say, just a string of bytes, not a
string of characters.
4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.
You still need to make sure that the editor's encoding and the
declared
encoding match.
Well, the if no encoding is declared, it (quite sensibly) assumes
UTF-8, so for my purposes this boils down to using a UTF-8 editor --
which I always do anyway. But do I still have to put a "u" before my
string literals in order to have it treated as characters rather than
bytes?
I'm hoping that the answer is "no" -- most string literals in a source
file are text (which should be Unicode text, these days); a raw byte
string would be the exceptional case, and I'd be happy to use the "r"
prefix for those.
Best,
- Joe
--
http://mail.python.org/mailman/listinfo/python-list