On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
Terry Reedy writes:
> The current UCS2 Unicode string implementation, by design, quickly gives
> WRONG answers for len(), iteration, indexing, and slicing if a string
> contains any non-BMP (surrogate pair) Unicode characters. That may have
> been excusable when there essentially were no such extended chars, and
> the few there were were almost never used.
Well, no, it gives the right answer according to the design. unicode
objects do not contain character strings.
Excuse me for believing the fine 3.2 manual that says
"Strings contain Unicode characters." (And to a naive reader, that
implies that string iteration and indexing should produce Unicode
characters.)
By design, they contain code point strings.
For the purpose of my sentence, the same thing in that code points
correspond to characters, where 'character' includes ascii control
'characters' and unicode analogs. The problem is that on narrow builds
strings are NOT code point sequences. They are 2-byte code *unit*
sequences. Single non-BMP code points are seen as 2 code units and hence
given a length of 2, not 1. Strings iterate, index, and slice by 2-byte
code units, not by code points.
Python floats try to follow the IEEE standard as interpreted for Python
(Python has its software exceptions rather than signalling versus
non-signalling hardware signals). Python decimals slavishly follow the
IEEE decimal standard. Python narrow build unicode breaks the standard
for non-BMP code points and cosequently, breaks the re module even when
it works for wide builds. As sys.maxunicode more or less says, only the
BMP subset is fully supported. Any narrow build string with even 1
non-BMP char violates the standard.
Guido has made that absolutely clear on a number
of occasions.
It is not clear what you mean, but recently on python-ideas he has
reiterated that he intends bytes and strings to be conceptually
different. Bytes are computer-oriented binary arrays; strings are
supposedly human-oriented character/codepoint arrays. Except they are
not for non-BMP characters/codepoints. Narrow build unicode is
effectively an array of two-byte binary units.
> And the reasons have very little to do with lack of
non-BMP characters to trip up the implementation. Changing those
semantics should have been done before the release of Python 3.
The documentation was changed at least a bit for 3.0, and anyway, as
indicated above, it is easy (especially for new users) to read the docs
in a way that makes the current behavior buggy. I agree that the
implementation should have been changed already.
Currently, the meaning of Python code differs on narrow versus wide
build, and in a way that few users would expect or want. PEP 393
abolishes narrow builds as we now know them and changes semantics. I was
answering a complaint about that change. If you do not like the PEP, fine.
My separate proposal in my other post is for an alternative
implementation but with, I presume, pretty the same visible changes.
It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today.
If the implementation is invisible to the Python user, as I believe it
should be without specially introspection, and mostly invisible in the
C-API except for those who intentionally poke into the details, then the
implementation can be changed as the consensus on best implementation
changes.
There are a number of approaches that I can think of.
1. The "too bad if you can't take a joke" approach: do nothing and
recommend UTF-32 to those who want len() to DTRT.
2. The "slope is slippery" approach: Implement UTF-16 objects as
built-ins, and then try to fend off requests for correct treatment
of unnormalized composed characters, normalization, compatibility
substitutions, bidi, etc etc.
3. The "are we not hackers?" approach: Implement a transform that
maps characters that are not represented by a single code point
into Unicode private space, and then see if anybody really needs
more than 6400 non-BMP characters. (Note that this would
generalize to composed characters that don't have a one-code-point
NFC form and similar non-standardized cases that nonstandard users
might want handled.)
4. The "42" approach: sadly, I can't think deeply enough to explain it.
There are probably others.
It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).
Given that 3.0 unicode (string) objects are defined as Unicode character
strings, I do not see the opposition.
But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.
--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com