On Tue, Jan 4, 2011 at 7:08 PM, Tony Morris <[email protected]> wrote: > I am reading files with System.IO.readFile. Some of these files start > with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that > process this String, this causes choking so I drop the BOM as shown > below. This feels particularly hacky, but I am not in control of many of > these functions (that perhaps could use ByteString with a better solution). > > I'm wondering if there is a better way of achieving this goal. Thanks > for any tips. > > > dropBOM :: > String > -> String > dropBOM [] = > [] > dropBOM s@(x:xs) = > let unicodeMarker = '\65279' -- UTF-8 BOM > in if x == unicodeMarker then xs else s > > readBOMFile :: > FilePath > -> IO String > readBOMFile p = > dropBOM `fmap` readFile p >
Are you thinking that the BOM should be automatically stripped from UTF8 text at some low level, if present? I was thinking about it, and I was deeply conflicted about the idea. Then I read the unicode.org BOM faq[1], and I'm still conflicted. I'm thinking that it would be correct behavior to drop the BOM from the start of a UTF8 stream, even at a pretty low level. The FAQ seems to allow it as a means of identifying the stream as UTF8 (although it isn't a reliable means of identifying a stream as UTF8). But I'm no unicode expert. Antoine [1] http://unicode.org/faq/utf_bom.html > > > > -- > Tony Morris > http://tmorris.net/ > > > > _______________________________________________ > Haskell-Cafe mailing list > [email protected] > http://www.haskell.org/mailman/listinfo/haskell-cafe > _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
