"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The question is why you would need to extract the nth codepoint so > blindly.
For example I'm scanning a string backwards (to remove '\n' at the end, to find and display the last N lines of a buffer, to find the last '/' or last '.' in a file name). SCSU in general supports traversal only forwards. > But remember the context in which this discussion was introduced: > which UTF would be the best to represent (and store) large sets of > immutable strings. The discussion about indexes in substrings is not > relevevant in that context. It is relevant. A general purpose string representation should support at least a bidirectional iterator, or preferably efficient random access. Neither is possible with SCSU. * * * Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we want to produce a string from this position to the end (a copy, since strings are immutable). With any stateless encoding a suitable library function will compute the length of the result, allocate memory, and do an equivalent of memcpy. With SCSU it's not possible to copy the string without analysing it because the prefix might have changed the state, so the suffix is not correct when treated as a standalone string. If the stripped part is short and the remaining part is long, it might pay off to scan the part we want to strip and perform a shortcut of memcpy if the prefix did not change the state (which is probably a common case). But in general we must recompress the whole copied part! We can't even precalculate its physical size. Decompressing into temporary memory will negate benefits of a compressed encoding, so we should better decompress and compress in parallel into a dynamically resizing buffer. This is ridiculously complex compared to a memcpy. The *only* advantage of SCSU is that it takes little space. Although in most programs most strings are ASCII, and SCSU never beats ISO-8859-1 which is what the implementation of my language is using for strings which no characters above U+00FF, so it usually does not have even this advantage. Disadvantages are everywhere else: every operation which looks at the contents of a string or produces contents of a string is more complex. Some operations can't be supported at all with the same asymptotic complexity, so the API would have to be changed as well to use opaque iterators instead of indices. It's more complicated both for internal processing and for interoperability (unless the other end understands SCSU too, which is unlikely). Plain immutable character arrays are not completely universal either (e.g. they are not sufficient for a buffer of a text editor), but they are appropriate as the default representation for common cases; for representing filenames, URLs, email addresses, computer language identifiers, command line option names, lines of a text file, messages in a dialog in a GUI, names of columns of a database table etc. Most strings are short and thus performing a physical copy when extracting a substring is not disastrous. But the complexity of SCSU is too bad. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/