> What I'd like to understand better is the "compatibility heirarchy" of > known encodings, in the positive sense that if a string decodes > successfully with encoding A, then it is also possible that it will > encode with encodings B, C; and in the negative sense that is if a > string fails to decode with encoding A, then for sure it will also > fail to decode with encodings B, C. Any ideas if such an analysis of > the relationships between encodings exists?
Most certainly. You'll have to learn a lot about many encodings though to really understand the relationships. Many encodings X are "ASCII supersets", in the sense that if you have only characters in the ASCII set, the encoding of the string in ASCII is the same as the encoding of the string in X. ISO-8859-X, ISO-2022-X, koi8-x, and UTF-8 fall in this category. Other encodings are "ASCII supersets" only in the sense that they include all characters of ASCII, but encode them differently. EBCDIC and UCS-2/4, UTF-16/32 fall in that category. Some encodings are 7-bit, so that they decode as ASCII (producing moji-bake if the input wasn't ASCII). ISO-2022-X is an example. Some encodings are 8-bit, so that they can decode arbitrary bytes (again producing moji-bake if the input wasn't that encoding). ISO-8859-X are examples, as are some of the EBCDIC encodings, and koi8-x. Also, things will successfully (but meaninglessly) decode as UTF-16 if the number of bytes in the input is even (likewise for UTF-32). HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list