Re: [Haskell-cafe] surrogate code points in a Char

Mark Lentczner Tue, 24 Nov 2009 22:51:14 -0800

On Nov 18, 2009, at 7:28 AM, Manlio Perillo wrote:
> The Unicode Standard (version 4.0, section 3.9, D31 - pag 76) says:
> 
> """Because surrogate code points are not included in the set of Unicode
> scalar values, UTF-32 code units in the range 0000D800 .. 0000DFFF are
> ill-formed"""


The current version of Unicode is 5.1. This text is now in D90, though 
otherwise the same. My references below are to the 5.1 documents (freely 
available on line at: http://www.unicode.org/versions/Unicode5.1.0/ )

> However GHC does not reject this code units:
> 
> Prelude> print '\x0000D800'
> '\55296'
> 
> Is this a correct behaviour?

I don't think you should consider Char to be UTF-32.

Think of Char as representing a Unicode code point. Unicode code points are 
defined as all in integers in the range \x0 through \x10FFFF, inclusive. Values 
in the range \xD800 through \xDFFF are all valid code points. (§2.4 in general; 
§3.4, D9, D10)

Not all Unicode code points are "Unicode scalar values". Only Unicode scalar 
values can be encoded in the standard Unicode encodings. Unicode scalar values 
are defined a \x0 through \xD7FF and \xE000 through \x10FFFF - All code points 
except the surrogate pair area. (§3.9, D76)

Not all code points are characters. In particular, \xFFFE, \xFFFF are 
"Noncharacters": They are representable in Unicode encodings, but should not be 
interchanged.  Less well known is the range \xFDD0 though \xFDEF which are also 
noncharacters. (§2.4, Table 2-3; §3.4, D14, §16.7)

In particular, note the stance of Unicode toward application internal forms:
        "Applications are free to use any of these noncharacter code points.
        internally but should never attempt to exchange them." - §16.7 ¶3

Accordingly, it is fine for Haskell's Char to support these values, as they are 
code points. The place to impose any special handling is in Haskell's various 
Unicode encoding libraries: When decoding, code points \xD800 through \xDFFF 
cannot be received, and noncharacters can be either retained or silently 
dropped (Unicode conformance allows this.) When encoding, code points \xD800 
through \xDFFF and noncharacters should either error or just be silently 
dropped.

        - Mark

Mark Lentczner
http://www.ozonehouse.com/mark/
[email protected]



_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] surrogate code points in a Char

Reply via email to