Hello, So I looked at what GHC does with Unicode and to me it is seems quite reasonable:
* The alphabet is Unicode code points, so a valid Haskell program is simply a list of those. * Combining characters are not allowed in identifiers, so no need for complex normalization rules: programs should always use the "short" version of a character, or be rejected. * Combining characters may appear in string literals, and there they are left "as is" without any modification (so some string literals may be longer than what's displayed in a text editor.) Perhaps this is simply what the report already states (I haven't checked, for which I apologize) but, if not, perhaps we should clarify things. -Iavor PS: I don't think that there is any need to specify a particular representation for the unicode code-points (e.g., utf-8 etc.) in the language standard. On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki <iavor.diatc...@gmail.com> wrote: > Hello, > I am also not an expert but I got curious and did a bit of Wikipedia > reading. Based on what I understood, here are two (related) questions > that it might be nice to clarify in a future version of the report: > > 1. What is the alphabet used by the grammar in the Haskell report? My > understanding is that the intention is that the alphabet is unicode > codepoints (sometimes referred to as unicode characters). There is no > way to refer to specific code-points by escaping as in Java (the link > that Gaby shared), you just have to write the code-points directly > (and there are plenty of encodings for doing that, e.g. UTF-8 etc.) > > 2. Do we respect "unicode equivalence" > (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source > code. The issue here is that, apparently, some sequences of unicode > code points/characters are supposed to be morally the same. For > example, it would appear that there are two different ways to write > the Spanish letter ñ: it has its own number, but it can also be made > by writing "n" followed by a modifier to put the wavy sign on top. > > I would guess that implementing "unicode equivalence" would not be > too hard---supposedly the unicode standard specifies a "text > normalization procedure". However, this would complicate the report > specification, because now the alphabet becomes not just unicode > code-points, but equivalence classes of code points. > > Thoughts? > > -Iavor > > > > > > > On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh <ig...@earth.li> wrote: >> >> Hi Gaby, >> >> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote: >>> >>> OK, thanks! I guess a take away from this discussion is that what >>> is a punctuation is far less well defined than it appears... >> >> I'm not really sure what you're asking. Haskell's uniSymbol includes all >> Unicode characters (should that be codepoints? I'm not a Unicode expert) >> in the punctuation category; I'm not sure what the best reference is, >> but e.g. table 12 in >> http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values >> lists a number of Px categories, and a meta-category P "Punctuation". >> >> >> Thanks >> Ian >> >> >> _______________________________________________ >> Haskell-prime mailing list >> Haskell-prime@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-prime _______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime