> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> Linux:
>
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
> 0000010 1e20 1c20 7e01 0a00
> 0000018
This is UTF-16LE (little-endian serialisation of UTF-16).
It does *not* conform to 10646 (which only allows for
big-endian serialisations) but does conform to Unicode.
An initial U+FEFF in UTF-16LE (or UTF-16BE) is interpreted
as a character (ZWNBSP) and must be kept.
> FreeBSD:
>
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
> 0000010 6b01 1e20 1c20 7e01 0a00
> 000001a
This is UTF-16[with-byte-order-mark; little-endian],
assuming that there was no U+FEFF in the beginning of
the source file (if there was, this would be UTF-16LE).
The (optional if big-endian) byte-order-mark is to be
removed after detecting the byte order.
Whether byte-order-marks (or more generally: "signatures")
is a good or bad idea is a matter of opinion. E.g.
Microsoft these days put a "signature" even in UTF-8 encoded
files. However, XML specifies that a byte order mark is
to be used for UTF-16 coded XML files, though it is not
really absolutely necessary (encoding-declarations are
good though; THOSE should have been required).
See also IETF RFC 2781.
Further, a "WORD JOINER" is on its way into 10646 and
Unicode. WORD JOINER is really ZWNBSP, and only that,
never a "signature".
Back to the question at hand:
My opinion is that iconv should accept the label UTF-16BE,
and act according to IETF RFC 2781 for that label. Thus,
iconv -f utf-8 -t utf-16be
should give the same UTF-16 big-endian, signatureless
encoding independent of platform (that has iconv).
/kent k
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/