Glenn Linderman writes: > Some bytes may decode into characters without needing to be > smuggled... maybe not in text-protocols like email, but in the > general case. So then some of the bytes that should be interpreted > as binary data are not in a disjoint set from characters.
True, but irrelevant. The point is that whoever chose the codec is responsible for getting it right, not only the right encoding, but for the assumption that the input data was pure encoded text. The rest of the program can now assume that choice was made correctly, and process text as text. The program cannot be blamed for assuming that the person who chose the codec knew what they were about, and so characters can be *assumed* to be decoded from bytes representing characters. This was not true in Python 2, where it was common practice to represent encoded text by itself internally, implicitly assuming that only one encoding would be encountered in each invocation of the program. This was never true, and with the spread of the Internet and then the WWW, it became a major issue. And that's why we invented Python 3, to let text be text without the encumbrance of always being aware of encodings and converting when different encodings collide, etc. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com