Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings

Eric Blake Mon, 20 Sep 2010 08:06:33 -0700

On 09/19/2010 07:13 AM, Bruno Haible wrote:

Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.

Particularly on platforms like Cygwin where sizeof(wchar_t) is 2, so youalready have the complication of dealing with surrogate pairs torepresent all possible Unicode characters (that is, cygwin disobeys therule that you have a 1-to-1 mapping between characters and wchar_t,since there are some characters that require 2 wchar_t).


It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.

Ah, but cygwin's approach is to convert invalid byte sequences into thesecond half of a Unicode surrogate pair. This is still recognizable inUTF-8 processing as an invalid character, but has the advantage that itcan still be handled like any other valid UTF-8 encoding for determininghow many bytes form each processing unit, and can be mapped 1-to-1 backto the original invalid byte sequence. Thus, any byte sequence of anylocale can be converted into this extended UTF-8 scheme, operationsperformed in UTF-8, then finally mapped back to the original locale, allwhile preserving the invalid byte sequences in the original localeuntouched by the UTF-8 processing.

For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a "multibyte character" datatype that accommodates also invalid
byte sequences.

Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.

Does it make sense to add some extended UTF-8 support into libunistring,then?


--
Eric Blake   [email protected]    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings

Reply via email to