Re: [Haskell-cafe] UTF-8 BOM

Mark Lentczner Wed, 05 Jan 2011 21:45:55 -0800

On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:

> Are you thinking that the BOM should be automatically stripped from
> UTF8 text at some low level, if present?

It should not. Wether or not a U+FFEF can be stripped depends on context in 
which it is found. There is no way that lower level code, even file primitives, 
can know this context.

> I'm thinking that it would be correct behavior to drop the BOM from
> the start of a UTF8 stream, even at a pretty low level. The FAQ seems
> to allow it as a means of identifying the stream as UTF8 (although it
> isn't a reliable means of identifying a stream as UTF8).

§3.9 and §3.10 of the Unicode standard go into more depth on the issue and make 
things more clear. A leading U+FFEF is considered "not part of the text", and 
dropped, only in the case that the encoding is UTF-16 or UTF-32. In all other 
cases (including the -BE and -LE variants of UTF-16 and UTF-32) the U+FFEF 
character is retained.

The FAQ states that a leading byte sequence of EF BB BF in a stream indicates 
that the stream is UTF-8, though it doesn't go so far as to say that it can be 
stripped. Since Unicode doesn't want to encourage the use of BOM in UTF-8 (see 
end of §3.10), I imagine they don't want to promulgate it as a useful encoding 
indicator.

So, it might be reasonable that when opening a file in UTF-16 mode (not 
UTF-16BE or UTF-16LE), that the system should read the initial bytes, determine 
the byte order, and remove the BOM if present[1]. But it isn't safe or correct 
to do this for UTF-8.

On Jan 4, 2011, at 5:08 PM, Tony Morris wrote:

> I am reading files with System.IO.readFile. Some of these files start
> with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that
> process this String, this causes choking so I drop the BOM as shown
> below.

If you mean functions in the standard libs shouldn't have any problems with the 
BOM character. If they do, these are bugs.

On the other hand, if you know the context of the files, and know for certain 
that the leading BOM is intended only as an encoding indicator, then by all 
means strip it off. But only you can know if this is true for your application, 
the system cannot. If so, your code doesn't look hackish to me at all. I'd only 
perhaps tidy up dropBOM a bit (but this is pure stylistic choice):

readBomFile :: FilePath -> IO String
readBomFile p = dropBom `fmap` readFile p
  where
    dropBom (\xffef:s) = s      -- U+FFEF at the start is a BOM
    dropBom s = s

I'd keep dropBom private to readBomFile to ensure that it isn't used on 
arbitrary strings, since it is really only valid at the start of an encoded 
stream.

- Mark

[1] Software reading a single text stream that has been split across files 
would have a problem here. But this is perhaps an obscure and unlikely case.
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] UTF-8 BOM

Reply via email to