Re: [Python-Dev] UCS2/UCS4 default

James Y Knight Thu, 03 Jul 2008 10:08:27 -0700

On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:

-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:

Your seem to be suggesting that len(u"\U00012345") should return 1 on
a system that internally uses UTF-16 and hence represents this string
as a surrogate pair.

From a Unicode and UTF-16 point of view that makes the most sense.So yes, I

am suggesting that.



I think this is misguided.

IMO, basically every programming language gets string handling wrong.(maybe with the exception of the unreleased perl6? it had someinteresting moves in this area, but I haven't really been payingattention.) Everyone treats strings as arrays, but they are used quitedifferently. For a string, there is hardly ever a time when aprogrammer needs to index it with an arbitrary offset in number ofcodepoints, and the length-in-codepoints is pretty non-useful as well.Constant-time access to arbitrary codepoints in a string is prettymuch unimportant. What *is* of utmost importantance is constant-timeaccess to previously-returned points in the string.


I'd like to have 3 levels of access available:

1) "byte"-level. In a new implementation I'd probably choose to makeall my strings stored in UTF-8, but UTF-16 is fine too.

2) codepoint-level.
3) grapheme-level.

You should be able to iterate over the string at any of the levels,ask for the nearest codepoint/grapheme boundary to the left or rightof an index at a different level, etc.

Python could probably still be made to work kinda like this. I think alanguage designed as such in the first place could be nicer, withopaque index objects into the string rather than integers, and such,but...whatever.


Let's assume python is changed to always store strings in UTF-16.

All it would take is adding a few more functions to the str object tooperate on the higher levels. Wherever I say "pos" I mean an integerindex into the string, at the UTF-16 level. That may sometimes beunaligned with the boundary of the representation you're asking about,and behavior in that case needs to be specified as well.

.nextcodepoint(curpos, how_many=1) -> returns an index into the stringhow_many codepoints to the right (or left if negative) of the indexcurpos.

.nextgrapheme(curpos, how_many=1) -> returns an index into the stringhow_many graphemes to the right (or left if negative) of the indexcurpos.

.codepoints(from_pos=0, to_pos=None) -> return an iterator ofcodepoints from 'from_pos' to 'to_pos'. I think codepoints could berepresented as strings themselves (so usually one character, sometimestwo character strings).

.graphemes(from_pos=0, to_pos=None) -> return an iterator of graphemesfrom 'from_pos' to 'to_pos'. Also could be represented by strings. Thereturned graphemes should probably be normalized.

There are a few more desirable operations, to manipulate strings atthe grapheme level (because unlike for UTF-8/UTF-16 codepoints,graphemes don't have the nice property of not containing prefixeswhich are themselves valid graphemes). So, you want a find (andeverything else that implicitly does a find operation, like split,replace, strip, etc) which requires that both endpoints of its matchare on a grapheme-boundary. [[Probably the easiest way to implementthis would be in the regexp engine.]]

A concrete example of that: u'A\N{COMBINING TILDE}\N{COMBINING MACRONBELOW}'.find(u'A\N{COMBINING TILDE}') returns 0. But you want a way toask for only a *actual* "A with tilde", not an "A with tilde andmacron".

Anyhow, I'm not going to tackle this issue or try to push it further,but if someone does tackle it, python could grow to have the bestunicode available. :)


James

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UCS2/UCS4 default

Reply via email to