On Thu, Mar 08, 2007 at 10:18:55PM -0500, Daniel B. wrote: > ???????? wrote: > .... > > I have yet to encounter a case where a "character" count is useful. > > Well, if an an editor the user tries to move forward three characters, > you probably want to increment a character count (an offset from > the beginning of the string).
1. Normally you want to move locally by a (very) small integer number of characters, e.g. 1, not to a particular character offset a long way away. While the latter is a valid operation and is expensive in UTF-8 it has no practical applications that I know of except when all characters occupy exactly one column and you’re trying to line up columns. Relative seeking by n characters in UTF-8 is O(n), independent of string length, so no problem for small relative cursor motion like your example. 2. Even in such an editor, normally the unit by which you want to move by is “graphemes” and not “characters”. That is, if the cursor is positioned prior to ‘ã’ (LATIN LETTER SMALL A + COMBINING TILDE) and you press the right arrow, you probably want it to move past both characters and not “between” the two. The concept of graphemes is slightly more complex in Indic scripts. There’s also the cases of Korean (decomposed Jamo), Tibetan (stacking letters), etc. which can be treated logically just like the A-TILDE example above. > (No, I don't know how dealing with glyphs instead of just characters > adds to that.) Hopefully the above answers a little bit of that uncertainty.. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
