Rauli Ruohonen writes: > The ones it absolutely prohibits in interchange are surrogates.
Excuse me? Surrogates are code points with a specific interpretation if it is "purported that the stream is in UTF-16". Otherwise, Unicode 4.0 explicitly says that there is nothing illegal about an isolated surrogate (p.75, where an example is given of how such a surrogate might occur). That surrogate may not be interpreted as an abstract character (C4, p.58), but it is not a non-character (Table 2-2, p.25). I agree that it's unfortunate that some parts of Python treat Unicode strings objects purely as sequences of Unicode code points, and others purport (apparently without checking) that such strings are in UTF-16. Unicode conformance is not part of the Python language. That's life. But let's try to avoid creating difficulties that don't exist in the standard. > > So there's nothing "wrong by definition" about defining strings as > > sequences of code points, and string operations in code-point-wise > > fashion. > It's not perfect, but that's the state of the art. AFAIK this (or worse) > is what the other implementations do. My point was precisely that I don't object to this implementation. I want Unicode-ly-correct behavior to be a goal of the language, the community disagrees, and Guido disagrees. That's that. Thanks you for starting work on implementation; let's concentrate on that. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com