Re: [Haskell-cafe] surrogate code points in a Char

2009-11-25 Thread Colin Adams
2009/11/25 Mark Lentczner ma...@glyphic.com:
 The current version of Unicode is 5.1. This text is now in D90, though 
 otherwise the same. My references below are to the 5.1 documents (freely 
 available on line at: http://www.unicode.org/versions/Unicode5.1.0/ )


It's been 5.2 for over a month now, I think.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] surrogate code points in a Char

2009-11-24 Thread Mark Lentczner

On Nov 18, 2009, at 7:28 AM, Manlio Perillo wrote:
 The Unicode Standard (version 4.0, section 3.9, D31 - pag 76) says:
 
 Because surrogate code points are not included in the set of Unicode
 scalar values, UTF-32 code units in the range D800 .. DFFF are
 ill-formed

The current version of Unicode is 5.1. This text is now in D90, though 
otherwise the same. My references below are to the 5.1 documents (freely 
available on line at: http://www.unicode.org/versions/Unicode5.1.0/ )

 However GHC does not reject this code units:
 
 Prelude print '\xD800'
 '\55296'
 
 Is this a correct behaviour?

I don't think you should consider Char to be UTF-32.

Think of Char as representing a Unicode code point. Unicode code points are 
defined as all in integers in the range \x0 through \x10, inclusive. Values 
in the range \xD800 through \xDFFF are all valid code points. (§2.4 in general; 
§3.4, D9, D10)

Not all Unicode code points are Unicode scalar values. Only Unicode scalar 
values can be encoded in the standard Unicode encodings. Unicode scalar values 
are defined a \x0 through \xD7FF and \xE000 through \x10 - All code points 
except the surrogate pair area. (§3.9, D76)

Not all code points are characters. In particular, \xFFFE, \x are 
Noncharacters: They are representable in Unicode encodings, but should not be 
interchanged.  Less well known is the range \xFDD0 though \xFDEF which are also 
noncharacters. (§2.4, Table 2-3; §3.4, D14, §16.7)

In particular, note the stance of Unicode toward application internal forms:
Applications are free to use any of these noncharacter code points.
internally but should never attempt to exchange them. - §16.7 ¶3

Accordingly, it is fine for Haskell's Char to support these values, as they are 
code points. The place to impose any special handling is in Haskell's various 
Unicode encoding libraries: When decoding, code points \xD800 through \xDFFF 
cannot be received, and noncharacters can be either retained or silently 
dropped (Unicode conformance allows this.) When encoding, code points \xD800 
through \xDFFF and noncharacters should either error or just be silently 
dropped.

- Mark

Mark Lentczner
http://www.ozonehouse.com/mark/
m...@glyphic.com



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] surrogate code points in a Char

2009-11-18 Thread Manlio Perillo
Hi.

The Unicode Standard (version 4.0, section 3.9, D31 - pag 76) says:

Because surrogate code points are not included in the set of Unicode
scalar values, UTF-32 code units in the range D800 .. DFFF are
ill-formed

However GHC does not reject this code units:

Prelude print '\xD800'
'\55296'


Is this a correct behaviour?
Note that Python, too (2.5.4, UCS4 build, Linux Debian), accept these
code units.



Thanks  Manlio
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] surrogate code points in a Char

2009-11-18 Thread Edward Kmett
Enforcing a gap in the middle of the range of Char would be exceedingly
awkward to propagate through all of the libraries. Off the top of my head:

1.) Functions like succ and pred which currently work on Char as an
enumeration would have to jump over the gap, to be truly anal retentive
about the mapping
2.) The toEnum and fromEnum would need to make the gap vanish as well,
ruining the ability to treat toEnum/fromEnum as chr/ord
3.) Every application would take a performance hit
4.) What to do in the presence of an encoding error is even more uncertain.
All you can do is throw an exception that can only be caught in IO.

A couple of less defensible considerations:

5.) It would break alternative encodings like utf-8b which use the invalid
code points in the surrogate pair range to encode ill-formed bytes in the
input stream to allow 'cut and paste'-safe round tripping of
utf-8b-Char-utf-8b even in the presence of invalid binary data/codepoints.
6.) Not all data is properly encoded. Consider, Unicode data you get back
from Oracle, which isn't really encoded in UTF-8, but is instead CESU-8,
which encodes codepoints in the higher plane as a surrogate pair, then utf-8
encodes the surrogate pair.

So, I suppose the answer would be it is functioning as designed, because the
current behavior is the least bad option. =)

-Edward Kmett

On Wed, Nov 18, 2009 at 10:28 AM, Manlio Perillo
manlio_peri...@libero.itwrote:

 Hi.

 The Unicode Standard (version 4.0, section 3.9, D31 - pag 76) says:

 Because surrogate code points are not included in the set of Unicode
 scalar values, UTF-32 code units in the range D800 .. DFFF are
 ill-formed

 However GHC does not reject this code units:

 Prelude print '\xD800'
 '\55296'


 Is this a correct behaviour?
 Note that Python, too (2.5.4, UCS4 build, Linux Debian), accept these
 code units.



 Thanks  Manlio
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe