On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
Terry Reedy writes:
> Excuse me for believing the fine 3.2 manual that says
> "Strings contain Unicode characters."
The manual is wrong, then, subject to a pronouncement to the contrary,
Please suggest a re-wording then, as it is a bug for doc and behavior to
disagree.
> For the purpose of my sentence, the same thing in that code points
> correspond to characters,
Not in Unicode, they do not. By definition, a small number of code
points (eg, U+FFFF) *never* did and *never* will correspond to
characters.
On computers, characters are represented by code points. What about the
other way around? http://www.unicode.org/glossary/#C says
code point:
1) i in range(0x11000) <broad definition>
2) "A value, or position, for a character" <narrow definition>
(To muddy the waters more, 'character' has multiple definitions also.)
You are using 1), I am using 2) ;-(.
> Any narrow build string with even 1 non-BMP char violates the
> standard.
Yup. That's by design.
[...]
Sure. Nevertheless, practicality beat purity long ago, and that
decision has never been rescinded AFAIK.
I think you have it backwards. I see the current situation as the purity
of the C code beating the practicality for the user of getting right
answers.
The thing is, that 90% of applications are not really going to care
about full conformance to the Unicode standard.
I remember when Intel argued that 99% of applications were not going to
be affected when the math coprocessor in its then new chips occasionally
gave 'non-standard' answers with certain divisors.
> Currently, the meaning of Python code differs on narrow versus wide
> build, and in a way that few users would expect or want.
Let them become developers, then, and show us how to do it better.
I posted a proposal with a link to a prototype implementation in Python.
It pretty well solves the problem of narrow builds acting different from
wide builds with respect to the basic operations of len(), iterations,
indexing, and slicing.
No, I do like the PEP. However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
character model. In particular, it does nothing to resolve the fact
that len() will give different answers for character count depending
on normalization, and that slicing and indexing will allow you to cut
characters in half (even in NFC, since not all composed characters
have fully composed forms).
I believe my scheme could be extended to solve that also. It would
require more pre-processing and more knowledge than I currently have of
normalization. I have the impression that the grapheme problem goes
further than just normalization.
--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com