Re: c++ strings and UTF-8 (other charsets)

Rich Felker Thu, 08 Mar 2007 20:42:08 -0800

On Thu, Mar 08, 2007 at 10:18:55PM -0500, Daniel B. wrote:
> ???????? wrote:
> ....
> > I have yet to encounter a case where a "character" count is useful.
> 
> Well, if an an editor the user tries to move forward three characters,
> you probably want to increment a character count (an offset from
> the beginning of the string).


1. Normally you want to move locally by a (very) small integer number
of characters, e.g. 1, not to a particular character offset a long way
away. While the latter is a valid operation and is expensive in UTF-8
it has no practical applications that I know of except when all
characters occupy exactly one column and you’re trying to line up
columns. Relative seeking by n characters in UTF-8 is O(n),
independent of string length, so no problem for small relative cursor
motion like your example.

2. Even in such an editor, normally the unit by which you want to move
by is “graphemes” and not “characters”. That is, if the cursor is
positioned prior to ‘ã’ (LATIN LETTER SMALL A + COMBINING TILDE) and
you press the right arrow, you probably want it to move past both
characters and not “between” the two. The concept of graphemes is
slightly more complex in Indic scripts. There’s also the cases of
Korean (decomposed Jamo), Tibetan (stacking letters), etc. which can
be treated logically just like the A-TILDE example above.

> (No, I don't know how dealing with glyphs instead of just characters
> adds to that.)

Hopefully the above answers a little bit of that uncertainty..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to