On 03/10/2009 13:50, Duncan Coutts wrote:
On Sat, 2009-10-03 at 04:50 -0700, Ian Lynagh wrote:
Wed Sep 30 01:42:29 PDT 2009 [email protected]
* Strip any Byte Order Mark (BOM) from the front of decoded streams.
Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
When decoding to UTF-32, Solaris iconv inserts a BOM at the front
of the stream, but Linux iconv doesn't.
M ./GHC/IO/Handle/Internals.hs -6 +27
I agree with Simon that this is not the correct fix.
As Simon suspected, Solaris iconv does indeed insert a BOM if you ask to
convert into UTF-32. Arguably this is actually the correct behaviour.
Also, as Simon suspected, it does not insert a BOM if you ask to convert
into UTF-32BE or LE.
So the better solution is to ask iconv for UTF-32BE or UTF-32LE
depending on the host byte order. This should also work correctly on
Linux so there doesn't need to be Solaris #ifdeffery (just host order
#ifdeffery which is needed anyway).
Demo: (Solaris iconv on big endian CPU)
echo foo | iconv -f UTF-8 -t UTF-32 | hexdump -c
0000000 \0 \0 376 377 \0 \0 \0 f \0 \0 \0 o \0 \0 \0 o
0000010 \0 \0 \0 \n
echo foo | iconv -f UTF-8 -t UTF-32BE | hexdump -c
0000000 \0 \0 \0 f \0 \0 \0 o \0 \0 \0 o \0 \0 \0 \n
The ambiguity with "UTF-32" is whether you are asking for UTF-32 in the
host byte order, or if you are asking for UTF-32 suitable for external
storage (which should use the BOM). The GNU iconv takes the former
interpretation while Solaris iconv takes the latter.
GNU iconv also adds a BOM for UTF-32. The difference is that Solaris
iconv apparently interprets "UCS-4" as "UTF-32", whereas GNU iconv
interprets it as "UTF-32BE". I happened to be using "UCS-4" in the IO
library for historical reasons, I never got around to changing it to
UTF-32. So all this is my fault :-(
Ian, could you back this patch out please?
I'll switch UCS-4 to UTF-32BE/LE and push that today.
Cheers,
Simon
_______________________________________________
Cvs-libraries mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-libraries