Martin Michlmayr <[EMAIL PROTECTED]> wrote: > (If anyone knows how to convert a string to UTF-8 in Python regardless > of whether it's UTF-8 or Latin or ASCII, and to convert a string to > ASCII/Latin regardless to whether it's UTF-8 or Latin, speak up now...)
I would say it is impossible, be it in Python or anything else: there are many byte sequences that are at the same time a valid Latin-1-encoded string and a valid UTF-8-encoded string (with both strings being different, but encoding one in Latin-1 and the other in UTF-8 happens to produce the same byte sequence). Example: /tmp % od -tx1 test-file 0000000 c3 a9 0a 0000003 If you read the c3 a9 string as Latin-1, you get (from iso-8859-1(7)): LATIN CAPITAL LETTER A WITH TILDE COPYRIGHT SIGN But if you consider this very same byte sequence as a UTF-8-encoded string, you read it as the single character: LATIN SMALL LETTER E WITH ACUTE (U+00E9) As a general rule, if you want to convert reliably between two charsets/encodings, you'd better know precisely how the input is encoded. -- Florent -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

