On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote:
> I believe that we should distinguish between glyph/character indexing
> and string indexing. Even in unicode it may be hard to decide where a
> visual glyph starts and ends. I assume most people would like to assign
> one glyph to one unicode, but that's not always possible with composed
> >>> for a in (u'\xc5',u'A\u030a'):
> ... for o in (u'\xf6',u'o\u0308'):
> ... u=a+u'ngstr'+o+u'm'
> ... print("%s %s" % (repr(u),u))
> u'\xc5ngstr\xf6m' Ångström
> u'\xc5ngstro\u0308m' Ångström
> u'A\u030angstr\xf6m' Ångström
> u'A\u030angstro\u0308m' Ångström
> >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> so even unicode doesn't always allow for O(1) glyph indexing.
What you're talking about here is "graphemes", not glyphs. Glyphs are the
little pictures that represent the characters when written down.
Graphemes (technically, "grapheme clusters") are the things which native
speakers of a language believe ought to be considered a single unit.
Think of them as similar to letters. That can be quite tricky to
determine, and is dependent on the language you are speaking. The letters
"ch" are considered two letters in English, but only a single letter in
Czech and Slovak.
I believe that *grapheme-aware* text processing is *far* too complicated
for a programming language to promise. If you think that len() needs to
count graphemes, then what should len("ch") return, 1 or 2? Grapheme
processing is a complex, complicated task best left up to powerful
libraries built on top of a sturdy Unicode base.
> I know this is artificial,
But it isn't artificial in the least. Unicode isn't complicated because
it's badly designed, or complicated for the sake of complexity. It's
complicated because human language is complicated. That, and because of
> but this is the same situation as utf8 faces just
> the frequency of occurrence is different. A very large amount of
> computing is still western centric so searching a byte string for latin
> characters is still efficient; searching for an n with a tilde on top
> might not be so easy.
This is a good point, but on balance I disagree. A grapheme-aware library
is likely to need to be based on more complex data structures than simple
strings (arrays of code points). But for the underlying relatively simple
string library, graphemes are too hard. Code points are simple, and the
language can deal with code points without caring about their semantics.
For instance, in English, I might not want to insert letters between the
q and u of "queen", since in English u (nearly) always follows q. It
would be inappropriate for the programming language string library to
care about that, and similarly it would be inappropriate for it to care
that u'A\u030a' represents a single grapheme Å.