Re: Nicest UTF

Marcin 'Qrczak' Kowalczyk Sun, 05 Dec 2004 12:49:38 -0800

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> The question is why you would need to extract the nth codepoint so
> blindly.


For example I'm scanning a string backwards (to remove '\n' at the
end, to find and display the last N lines of a buffer, to find the
last '/' or last '.' in a file name). SCSU in general supports
traversal only forwards.

> But remember the context in which this discussion was introduced:
> which UTF would be the best to represent (and store) large sets of
> immutable strings. The discussion about indexes in substrings is not
> relevevant in that context.

It is relevant. A general purpose string representation should support
at least a bidirectional iterator, or preferably efficient random access.
Neither is possible with SCSU.

* * *

Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want to produce a string
from this position to the end (a copy, since strings are immutable).

With any stateless encoding a suitable library function will compute
the length of the result, allocate memory, and do an equivalent of
memcpy.

With SCSU it's not possible to copy the string without analysing it
because the prefix might have changed the state, so the suffix is not
correct when treated as a standalone string. If the stripped part is
short and the remaining part is long, it might pay off to scan the
part we want to strip and perform a shortcut of memcpy if the prefix
did not change the state (which is probably a common case). But in
general we must recompress the whole copied part! We can't even
precalculate its physical size. Decompressing into temporary memory
will negate benefits of a compressed encoding, so we should better
decompress and compress in parallel into a dynamically resizing
buffer. This is ridiculously complex compared to a memcpy.

The *only* advantage of SCSU is that it takes little space. Although
in most programs most strings are ASCII, and SCSU never beats
ISO-8859-1 which is what the implementation of my language is using
for strings which no characters above U+00FF, so it usually does
not have even this advantage.

Disadvantages are everywhere else: every operation which looks at the
contents of a string or produces contents of a string is more complex.
Some operations can't be supported at all with the same asymptotic
complexity, so the API would have to be changed as well to use opaque
iterators instead of indices. It's more complicated both for internal
processing and for interoperability (unless the other end understands
SCSU too, which is unlikely).

Plain immutable character arrays are not completely universal either
(e.g. they are not sufficient for a buffer of a text editor), but they
are appropriate as the default representation for common cases; for
representing filenames, URLs, email addresses, computer language
identifiers, command line option names, lines of a text file, messages
in a dialog in a GUI, names of columns of a database table etc. Most
strings are short and thus performing a physical copy when extracting
a substring is not disastrous. But the complexity of SCSU is too bad.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

Reply via email to