Firstly, I personally would love to insist on using UTF-8 and be done with it. I see no reason to bother with other character encodings.
2011/4/4 Roel van Dijk <[email protected]> > 2011/4/4 Colin Adams <[email protected]>: > > Not from looking with your eyes perhaps. Does that matter? Your text > editor, > > and the compiler, can surely figure it out for themselves. > I am not aware of any algorithm that can reliably infer the character > encoding used by just looking at the raw data. Why would people bother > with stuff like <?xml version="1.0" encoding="UTF-8"?> if > automatically figuring out the encoding was easy? > > There *is* an algorithm for determining the encoding of an XML file based on a combination of the BOM (Byte Order Marker) and an assumption that the file will start with a XML declaration (i.e., <?xml ... ?>). But this isn't capable of determining every possible encoding on the planet, just distinguishing amongst varieties of UTF-(8|16|32)/(big|little) endian and EBCIDC. It cannot tell the difference between UTF-8, Latin-1, and Windows-1255 (Hebrew), for example. > > There aren't many Unicode encoding formats > From casually scanning some articles about encodings I can count at > least 70 character encodings [1]. > > I think the implication of "Unicode encoding formats" is something in the UTF family. An encoding like Latin-1 or Windows-1255 can be losslessly translated into Unicode codepoints, but it's not exactly an encoding of Unicode, but rather a subset of Unicode. > > and there aren't very many possibilities for the > > leading characters of a Haskell source file, are there? > Since a Haskell program is a sequence of Unicode code points the > programmer can choose from up to 1,112,064 characters. Many of these > can legitimately be part of the interface of a module, as function > names, operators or names of types. > > My guess is that a large subset of Haskell modules start with one of left brace (starting with comment or language pragma), m (for starting with module), or some whitespace character. So it *might* be feasible to take a guess at things. But as I said before: I like UTF-8. Is there anyone out there who has a compelling reason for writing their Haskell source in EBCDIC? Michael
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
