On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
Terry Reedy writes:

  >  The current UCS2 Unicode string implementation, by design, quickly gives
  >  WRONG answers for len(), iteration, indexing, and slicing if a string
  >  contains any non-BMP (surrogate pair) Unicode characters. That may have
  >  been excusable when there essentially were no such extended chars, and
  >  the few there were were almost never used.

Well, no, it gives the right answer according to the design.  unicode
objects do not contain character strings.

Excuse me for believing the fine 3.2 manual that says
"Strings contain Unicode characters." (And to a naive reader, that implies that string iteration and indexing should produce Unicode characters.)

 By design, they contain code point strings.

For the purpose of my sentence, the same thing in that code points correspond to characters, where 'character' includes ascii control 'characters' and unicode analogs. The problem is that on narrow builds strings are NOT code point sequences. They are 2-byte code *unit* sequences. Single non-BMP code points are seen as 2 code units and hence given a length of 2, not 1. Strings iterate, index, and slice by 2-byte code units, not by code points.

Python floats try to follow the IEEE standard as interpreted for Python (Python has its software exceptions rather than signalling versus non-signalling hardware signals). Python decimals slavishly follow the IEEE decimal standard. Python narrow build unicode breaks the standard for non-BMP code points and cosequently, breaks the re module even when it works for wide builds. As sys.maxunicode more or less says, only the BMP subset is fully supported. Any narrow build string with even 1 non-BMP char violates the standard.

Guido has made that absolutely clear on a number
of occasions.

It is not clear what you mean, but recently on python-ideas he has reiterated that he intends bytes and strings to be conceptually different. Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays. Except they are not for non-BMP characters/codepoints. Narrow build unicode is effectively an array of two-byte binary units.

> And the reasons have very little to do with lack of
non-BMP characters to trip up the implementation.  Changing those
semantics should have been done before the release of Python 3.

The documentation was changed at least a bit for 3.0, and anyway, as indicated above, it is easy (especially for new users) to read the docs in a way that makes the current behavior buggy. I agree that the implementation should have been changed already.

Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want. PEP 393 abolishes narrow builds as we now know them and changes semantics. I was answering a complaint about that change. If you do not like the PEP, fine.

My separate proposal in my other post is for an alternative implementation but with, I presume, pretty the same visible changes.

It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today.

If the implementation is invisible to the Python user, as I believe it should be without specially introspection, and mostly invisible in the C-API except for those who intentionally poke into the details, then the implementation can be changed as the consensus on best implementation changes.

There are a number of approaches that I can think of.

1.  The "too bad if you can't take a joke" approach: do nothing and
     recommend UTF-32 to those who want len() to DTRT.
2.  The "slope is slippery" approach: Implement UTF-16 objects as
     built-ins, and then try to fend off requests for correct treatment
     of unnormalized composed characters, normalization, compatibility
     substitutions, bidi, etc etc.
3.  The "are we not hackers?" approach: Implement a transform that
     maps characters that are not represented by a single code point
     into Unicode private space, and then see if anybody really needs
     more than 6400 non-BMP characters.  (Note that this would
     generalize to composed characters that don't have a one-code-point
     NFC form and similar non-standardized cases that nonstandard users
     might want handled.)
4.  The "42" approach: sadly, I can't think deeply enough to explain it.

There are probably others.

It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).

Given that 3.0 unicode (string) objects are defined as Unicode character strings, I do not see the opposition.

But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to