> So let me explain it. I see two different sequences of code points:
> 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> claim they are equivalent. They are two different sequences of code
> points.

If they were sequences of integers, or sequences of bytes, I'd agree
with you.  But they are explicitly sequences of characters, not
sequences of codepoints.  There should be one internal normalized form
for strings.

> We should not hide that Python's unicode string object can
> store each sequence of code points equally well, and that when viewed
> as a sequence they are different: the first has len() == 5, the scond
> has len() == 6!

We should definitely not expose that difference!

> When read from a file they are different.

A file is in UTF-8, or UTF-2, or whatever -- it contains a string
coerced to a sequence of bits.  Whatever reads that file should in
fact either preserve that sequence of bytes (in which case it's not a
string), or coerce it to a Unicode string, in which case the file
representation is immaterial and the Python normalized form is used
internally.

> I might be
> writing either literal with the expectation to get exactly that
> sequence of code points, in order to use it as a test case or as input
> for another program that requires specific input.

In that case you should write it as a sequence of integers, because
that's what you're dealing with.

> There's a simpler solution. The unicode (or str, in Py3k) data type
> represents a sequence of code points, not a sequence of characters.
> This has always been the case, and will continue to be the case.

Bad idea, IMO.

Bill
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to