On 09/19/2010 07:13 AM, Bruno Haible wrote:
Correct. This is one of the major design decisions Paul, Jim, and I agreed upon in 2001. It is this requirement which forbids converting the input to a wchar_t stream, doing processing with wchar_t objects, and producing a stream of wchar_t objects that are finally converted to multibyte representation again.
Particularly on platforms like Cygwin where sizeof(wchar_t) is 2, so you already have the complication of dealing with surrogate pairs to represent all possible Unicode characters (that is, cygwin disobeys the rule that you have a 1-to-1 mapping between characters and wchar_t, since there are some characters that require 2 wchar_t).
It is this requirement which also forbids converting the input to UTF-8, doing processing with Unicode characters, and converting the Unicode character stream to multibyte representation at the end. This approach is acceptable for a word processor that can refuse to open a file, or for more general applications. But for coreutils, where classical behaviour is to get reasonable processing in the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach cannot be done.
Ah, but cygwin's approach is to convert invalid byte sequences into the second half of a Unicode surrogate pair. This is still recognizable in UTF-8 processing as an invalid character, but has the advantage that it can still be handled like any other valid UTF-8 encoding for determining how many bytes form each processing unit, and can be mapped 1-to-1 back to the original invalid byte sequence. Thus, any byte sequence of any locale can be converted into this extended UTF-8 scheme, operations performed in UTF-8, then finally mapped back to the original locale, all while preserving the invalid byte sequences in the original locale untouched by the UTF-8 processing.
For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile', which provide a "multibyte character" datatype that accommodates also invalid byte sequences. Emacs handles this requirement by extending UTF-8. But this approach is unique to Emacs: libunistring and other software support plain UTF-8, not extended UTF-8.
Does it make sense to add some extended UTF-8 support into libunistring, then?
-- Eric Blake [email protected] +1-801-349-2682 Libvirt virtualization library http://libvirt.org
