On 6/8/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > AFAIK, the only strings the Unicode standard absolutely prohibits > emitting are those containing code points guaranteed not to be > characters by the standard.
The ones it absolutely prohibits in interchange are surrogates. They are also illegal in both UTF-16 and UTF-8. The pragmatic reason is that if you do encode them despite their illegality (like Python codecs do), strings won't always survive a round-trip to such pseudo-UTF-16 because multiple code point sequences necessarily map to the same byte sequence. For some reason Python's UTF-8 encoder introduces this ambiguity too, even though there's no need to do so with pseudo-UTF-8. In Python UCS-2 builds even string processing in the core works inconsistently with surrogates. Sometimes pseudo-UCS-2 is assumed, sometimes pseudo-UTF-16, and these are incompatible because pseudo-UTF-16 can't always represent surrogates, but pseudo-UCS-2 can. OTOH pseudo-UCS-2 can't represent code points outside the BMP, but pseudo-UTF-16 can. There's no way to always do the right thing as long as these two are mixed, but somebody somewhere probably depends on this behavior. Other than surrogates, there are two classes of characters with "restricted interchange". One is reserved characters, which need to be preserved if found in text for compatibility with future versions of the standard. Another is noncharacters, which are "reserved for internal use, such as for sentinel values". These should obviously be allowed, as the user may want to use them internally in their Python program. > So there's nothing "wrong by definition" about defining strings as > sequences of code points, and string operations in code-point-wise > fashion. It just makes that library for Unicode more expensive to > design and operate, and will require auditing and reimplementation of > common libraries (including the standard library) by every program > that requires strict Unicode conformance. It's not perfect, but that's the state of the art. AFAIK this (or worse) is what the other implementations do. Even the Unicode standard explains that strings generally work that way: 2.7. Unicode Strings A Unicode string datatype is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not also be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com