Re: Nicest UTF

Marcin 'Qrczak' Kowalczyk Sun, 05 Dec 2004 08:45:17 -0800

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

>> The point is that indexing should better be O(1).
>
> SCSU is also O(1) in terms of indexing complexity...


It is not. You can't extract the nth code point without scanning the
previous n-1 code points.

> But individual characters do not always have any semantic. For
> languages, the relevant unit is almost always the grapheme cluster,
> not the character (so not its code point...).

How do you determine the semantics of a grapheme cluster? Answer: by
splitting it into code points. A code point is atomic, it's not split
any more, because there is a finite number of them.

When a string is exchanged with another application or network
computer or the OS, it always uses some encoding which is closer to
code points than to grapheme clusters, no matter if it's UTF-8 or
UTF-16 or ISO-8859-something. If the string was originally stored as
an array of grapheme clusters, it would have to be translated to code
points before further conversion.

> Which represent will be the best is left to implementers, but I really
> think that compressed schemes are often introduced to increase the
> application performances and reduce the needed resources both in
> memory and for I/O, but also in networking where interoperability
> across systems and bandwidth optimization are also important design
> goals...

UTF-8 is much better for interoperability than SCSU, because it's
already widely supported and SCSU is not.

It's also easier to add support for UTF-8 than for SCSU. UTF-8 is
stateless, SCSU is stateful - this is very important. UTF-8 is easier
to encode and decode.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

Reply via email to