Per the Haskell Prime process I would like to make an official proposal [1].
* Proposal The Haskell 2010 language specification states that: "Haskell uses the Unicode character set" [2]. It does not state what encoding should be used. This means, strictly speaking, it is not possible to reliably exchange Haskell source files on the byte level. I propose to make UTF-8 the only allowed encoding for Haskell source files. Implementations must discard an initial Byte Order Mark (BOM) if present [3]. * Pros - Ensures that Haskell source can be reliably exchanged on the byte level. - Disallows implicit ISO-8859-* encodings in source code, ensuring portability. - Little or no implementation burden for compiler writers. * Cons - Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion. (Only relevant for Hugs-only code). * Implementation status ** GHC "GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised. However, invalid UTF-8 sequences will be ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only." [4] >From this I deduce that all current code accepted by GHC is compatible with UTF-8. No working code will be broken. ** JHC "JHC allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8." [5] ** Hugs Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals. [6] * Related proposal There is one, 5 year old, proposal that is related: "SourceEncodingDetection" [5]. There it is proposed to detect the encoding using an algorithm which can distinguish between UTF-8, UTF-16 and (not always) UTF-32. It can also detect the endianness of the document, if applicable. I think choosing just UTF-8 is a better choice than a detection algorithm. It places less burden on implementation writers and is even more portable. * Next step Discussion! There was already some discussion on the haskell-cafe mailing list [7]. Attached is a patch for the Haskell Report which adds a note stating that source encodings must be UTF-8. Regards, Roel van Dijk [1] - http://hackage.haskell.org/trac/haskell-prime/wiki/Process [2] - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-150002.1 [3] - http://www.unicode.org/faq/utf_bom.html#bom5 [4] - http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/separate-compilation.html#source-files [5] - http://hackage.haskell.org/trac/haskell-prime/wiki/SourceEncodingDetection [6] - http://cvs.haskell.org/Hugs/pages/users_guide/locale.html [7] - http://article.gmane.org/gmane.comp.lang.haskell.cafe/87815
utf8_encoding.dpatch
Description: Binary data
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime