On 8/31/2011 1:10 PM, Guido van Rossum wrote:

This is why I find the issue of Python, the language (and stdlib), as
a whole "conforming to the Unicode standard" such a troublesome
concept -- I think it is something that an application may claim, but
the language should make much more modest claims, such as "the regular
expression syntax supports features X, Y and Z from the Unicode
recommendation XXX, or "the UTF-8 codec will never emit a sequence of
bytes that is invalid according Unicode specification YYY". (As long
as the Unicode references are also versioned or dated.)

This will be a great improvement. It was both embarrassing and frustrating to have to respond to Tom C.'s (and other's) issue with "Our unicode type is too vaguely documented to tell whether you are reporting a bug or making a feature request.

But if you can observe (valid) surrogate pairs it is still UTF-16.
...
Ok, I dig this, to some extent. However saying it is UCS-2 is equally
bad.

As I said on the tracker, our narrow builds are in-between (while moving closer to UTF-16), and both terms are deceptive, at least to some.

At the same time I think it would be useful if certain string
operations like .lower() worked in such a way that *if* the input were
valid UTF-16, *then* the output would also be, while *if* the input
contained an invalid surrogate, the result would simply be something
that is no worse (in particular, those are all mapped to themselves).
We could even go further and have .lower() and friends look at
graphemes (multi-code-point characters) if the Unicode std has a
useful definition of e.g. lowercasing graphemes that differed from
lowercasing code points.

An analogy is actually found in .lower() on 8-bit strings in Python 2:
it assumes the string contains ASCII, and non-ASCII characters are
mapped to themselves. If your string contains Latin-1 or EBCDIC or
UTF-8 it will not do the right thing. But that doesn't mean strings
cannot contain those encodings, it just means that the .lower() method
is not useful if they do. (Why ASCII? Because that is the system
encoding in Python 2.)

Good analogy.

Let's call those things graphemes (Tom C's term, I quite like leaving
"character" ambiguous) -- they are sequences of multiple code points
that represent a single "visual squiggle" (the kind of thing that
you'd want to be swappable in vim with "xp" :-). I agree that APIs are
needed to manipulate (match, generate, validate, mutilate, etc.)
things at the grapheme level. I don't agree that this means a separate
data type is required.

I presume by 'separate data type' you mean a base level builtin class like int or str and that you would allow for wrapper classes built on top of str, as such are not really 'separate'. For grapheme leval and higher, we should certainly start with wrappers and probably with alternate versions based on different strategies.

There are ever-larger units of information
encoded in text strings, with ever farther-reaching (and more vague)
requirements on valid sequences. Do you want to have a data type that
can represent (only valid) words in a language? Sentences? Novels?
...
I think that at this point in time the best we can do is claim that
Python (the language standard) uses either 16-bit code units or 21-bit
code points in its string datatype, and that, thanks to PEP 393,
CPython 3.3 and further will always use 21-bit code points (but Jython
and IronPython may forever use their platform's native 16-bit code
unit representing string type). And then we add APIs that can be used
everywhere to look for code points (even if the string contains code
points), graphemes, or larger constructs. I'd like those APIs to be
designed using a garbage-in-garbage-out principle, where if the input
conforms to some Unicode requirement, the output does too, but if the
input doesn't, the output does what makes most sense. Validation is
then limited to codecs, and optional calls.

If you index or slice a string, or create a string from chr() of a
surrogate or from some other value that the Unicode standard considers
an illegal code point, you better know what you are doing. I want
chr(i) to be valid for all values of i in range(2**21),

Actually, it is range(0X110000) == range(1114112) so that UTF-8 uses at most 4 bytes per codepoint. 21 bits is 20.1 bits rounded up.

so it can be
used to create a lone surrogate, or (on systems with 16-bit
"characters") a surrogate pair. And also ord(chr(i)) == i for all i in
range(2**21).

for i in range(0x110000):  # 1114112
    if ord(chr(i)) != i:
        print(i)
# prints nothing (on Windows)

> I'm not sure about ord() on a 2-character string
containing a surrogate pair on systems where strings contain 21-bit
code points; I think it should be an error there, just as ord() on
other strings of length != 1. But on systems with 16-bit "characters",
ord() of strings of length 2 containing a valid surrogate pair should
work.

And now does, thanks to whoever fixed this (withing the last year, I think).

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to