On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:

-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:
Your seem to be suggesting that len(u"\U00012345") should return 1 on
a system that internally uses UTF-16 and hence represents this string
as a surrogate pair.

From a Unicode and UTF-16 point of view that makes the most sense. So yes, I
am suggesting that.


I think this is misguided.

IMO, basically every programming language gets string handling wrong. (maybe with the exception of the unreleased perl6? it had some interesting moves in this area, but I haven't really been paying attention.) Everyone treats strings as arrays, but they are used quite differently. For a string, there is hardly ever a time when a programmer needs to index it with an arbitrary offset in number of codepoints, and the length-in-codepoints is pretty non-useful as well. Constant-time access to arbitrary codepoints in a string is pretty much unimportant. What *is* of utmost importantance is constant-time access to previously-returned points in the string.

I'd like to have 3 levels of access available:
1) "byte"-level. In a new implementation I'd probably choose to make all my strings stored in UTF-8, but UTF-16 is fine too.
2) codepoint-level.
3) grapheme-level.

You should be able to iterate over the string at any of the levels, ask for the nearest codepoint/grapheme boundary to the left or right of an index at a different level, etc.

Python could probably still be made to work kinda like this. I think a language designed as such in the first place could be nicer, with opaque index objects into the string rather than integers, and such, but...whatever.

Let's assume python is changed to always store strings in UTF-16.

All it would take is adding a few more functions to the str object to operate on the higher levels. Wherever I say "pos" I mean an integer index into the string, at the UTF-16 level. That may sometimes be unaligned with the boundary of the representation you're asking about, and behavior in that case needs to be specified as well.

.nextcodepoint(curpos, how_many=1) -> returns an index into the string how_many codepoints to the right (or left if negative) of the index curpos.

.nextgrapheme(curpos, how_many=1) -> returns an index into the string how_many graphemes to the right (or left if negative) of the index curpos.

.codepoints(from_pos=0, to_pos=None) -> return an iterator of codepoints from 'from_pos' to 'to_pos'. I think codepoints could be represented as strings themselves (so usually one character, sometimes two character strings).

.graphemes(from_pos=0, to_pos=None) -> return an iterator of graphemes from 'from_pos' to 'to_pos'. Also could be represented by strings. The returned graphemes should probably be normalized.

There are a few more desirable operations, to manipulate strings at the grapheme level (because unlike for UTF-8/UTF-16 codepoints, graphemes don't have the nice property of not containing prefixes which are themselves valid graphemes). So, you want a find (and everything else that implicitly does a find operation, like split, replace, strip, etc) which requires that both endpoints of its match are on a grapheme-boundary. [[Probably the easiest way to implement this would be in the regexp engine.]]


A concrete example of that: u'A\N{COMBINING TILDE}\N{COMBINING MACRON BELOW}'.find(u'A\N{COMBINING TILDE}') returns 0. But you want a way to ask for only a *actual* "A with tilde", not an "A with tilde and macron".



Anyhow, I'm not going to tackle this issue or try to push it further, but if someone does tackle it, python could grow to have the best unicode available. :)

James

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to