base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

Simon Marlow Mon, 05 Oct 2009 03:12:32 -0700

On 03/10/2009 13:50, Duncan Coutts wrote:

On Sat, 2009-10-03 at 04:50 -0700, Ian Lynagh wrote:

Wed Sep 30 01:42:29 PDT 2009  [email protected]
   * Strip any Byte Order Mark (BOM) from the front of decoded streams.
   Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
   When decoding to UTF-32, Solaris iconv inserts a BOM at the front
   of the stream, but Linux iconv doesn't.


     M ./GHC/IO/Handle/Internals.hs -6 +27


I agree with Simon that this is not the correct fix.

As Simon suspected, Solaris iconv does indeed insert a BOM if you ask to
convert into UTF-32. Arguably this is actually the correct behaviour.
Also, as Simon suspected, it does not insert a BOM if you ask to convert
into UTF-32BE or LE.

So the better solution is to ask iconv for UTF-32BE or UTF-32LE
depending on the host byte order. This should also work correctly on
Linux so there doesn't need to be Solaris #ifdeffery (just host order
#ifdeffery which is needed anyway).

Demo: (Solaris iconv on big endian CPU)

echo foo | iconv -f UTF-8 -t UTF-32 | hexdump -c

0000000  \0  \0 376 377  \0  \0  \0   f  \0  \0  \0   o  \0  \0  \0   o
0000010  \0  \0  \0  \n

echo foo | iconv -f UTF-8 -t UTF-32BE | hexdump -c
0000000  \0  \0  \0   f  \0  \0  \0   o  \0  \0  \0   o  \0  \0  \0  \n

The ambiguity with "UTF-32" is whether you are asking for UTF-32 in the
host byte order, or if you are asking for UTF-32 suitable for external
storage (which should use the BOM). The GNU iconv takes the former
interpretation while Solaris iconv takes the latter.

GNU iconv also adds a BOM for UTF-32. The difference is that Solarisiconv apparently interprets "UCS-4" as "UTF-32", whereas GNU iconvinterprets it as "UTF-32BE". I happened to be using "UCS-4" in the IOlibrary for historical reasons, I never got around to changing it toUTF-32. So all this is my fault :-(


Ian, could you back this patch out please?

I'll switch UCS-4 to UTF-32BE/LE and push that today.

Cheers,
        Simon
_______________________________________________
Cvs-libraries mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-libraries

Re: patch applied (ghc-6.12/packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

Reply via email to