Re: [Python-Dev] PEP 393 Summer of Code Project

Terry Reedy Wed, 24 Aug 2011 03:12:11 -0700

On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:

Terry Reedy writes:


  >  The current UCS2 Unicode string implementation, by design, quickly gives
  >  WRONG answers for len(), iteration, indexing, and slicing if a string
  >  contains any non-BMP (surrogate pair) Unicode characters. That may have
  >  been excusable when there essentially were no such extended chars, and
  >  the few there were were almost never used.

Well, no, it gives the right answer according to the design.  unicode
objects do not contain character strings.


Excuse me for believing the fine 3.2 manual that says

"Strings contain Unicode characters." (And to a naive reader, thatimplies that string iteration and indexing should produce Unicodecharacters.)

 By design, they contain code point strings.

For the purpose of my sentence, the same thing in that code pointscorrespond to characters, where 'character' includes ascii control'characters' and unicode analogs. The problem is that on narrow buildsstrings are NOT code point sequences. They are 2-byte code *unit*sequences. Single non-BMP code points are seen as 2 code units and hencegiven a length of 2, not 1. Strings iterate, index, and slice by 2-bytecode units, not by code points.

Python floats try to follow the IEEE standard as interpreted for Python(Python has its software exceptions rather than signalling versusnon-signalling hardware signals). Python decimals slavishly follow theIEEE decimal standard. Python narrow build unicode breaks the standardfor non-BMP code points and cosequently, breaks the re module even whenit works for wide builds. As sys.maxunicode more or less says, only theBMP subset is fully supported. Any narrow build string with even 1non-BMP char violates the standard.

Guido has made that absolutely clear on a number
of occasions.

It is not clear what you mean, but recently on python-ideas he hasreiterated that he intends bytes and strings to be conceptuallydifferent. Bytes are computer-oriented binary arrays; strings aresupposedly human-oriented character/codepoint arrays. Except they arenot for non-BMP characters/codepoints. Narrow build unicode iseffectively an array of two-byte binary units.


> And the reasons have very little to do with lack of

non-BMP characters to trip up the implementation.  Changing those
semantics should have been done before the release of Python 3.

The documentation was changed at least a bit for 3.0, and anyway, asindicated above, it is easy (especially for new users) to read the docsin a way that makes the current behavior buggy. I agree that theimplementation should have been changed already.

Currently, the meaning of Python code differs on narrow versus widebuild, and in a way that few users would expect or want. PEP 393abolishes narrow builds as we now know them and changes semantics. I wasanswering a complaint about that change. If you do not like the PEP, fine.

My separate proposal in my other post is for an alternativeimplementation but with, I presume, pretty the same visible changes.

It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today.

If the implementation is invisible to the Python user, as I believe itshould be without specially introspection, and mostly invisible in theC-API except for those who intentionally poke into the details, then theimplementation can be changed as the consensus on best implementationchanges.

There are a number of approaches that I can think of.

1.  The "too bad if you can't take a joke" approach: do nothing and
     recommend UTF-32 to those who want len() to DTRT.
2.  The "slope is slippery" approach: Implement UTF-16 objects as
     built-ins, and then try to fend off requests for correct treatment
     of unnormalized composed characters, normalization, compatibility
     substitutions, bidi, etc etc.
3.  The "are we not hackers?" approach: Implement a transform that
     maps characters that are not represented by a single code point
     into Unicode private space, and then see if anybody really needs
     more than 6400 non-BMP characters.  (Note that this would
     generalize to composed characters that don't have a one-code-point
     NFC form and similar non-standardized cases that nonstandard users
     might want handled.)
4.  The "42" approach: sadly, I can't think deeply enough to explain it.

There are probably others.

It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).

Given that 3.0 unicode (string) objects are defined as Unicode characterstrings, I do not see the opposition.

But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to