Haskell 1.4 and Unicode
I have some questions regarding Haskell 1.4 and Unicode. My source materials for these questions are "The Haskell 1.4 Report" and the files ftp://ftp.unicode.org/Public/2.0-Update/ReadMe-2.0.14.txt and ftp://ftp.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt It's possible that question 2 below would be resolved if I actually read the Unicode book; if so, I apologize in advance. 1) I assume that layout processing occurs after Unicode preprocessing; otherwise, you can't even find the lexemes. If so, are all Unicode characters assumed to be the same width? 2) The Report uses the following classes of characters: uniWhite - any UNIcode character defined as whitespace nonbrkspc ??? UNIsmall - any Unicode lowercase letter UNIlarge - any uppercase or titlecase Unicode letter UNIsymbol - Any Unicode symbol or punctuation UNIdigit - A Unicode numberic The file ReadMe-2.0.14.txt above defines the following classes of characters: Normative Mn = Mark, Non-Spacing Mc = Mark, Spacing Combining Me = Mark, Enclosing Nd = Number, Decimal Digit Nl = Number, Letter No = Number, Other Zs = Separator, Space Zl = Separator, Line Zp = Separator, Paragraph Cc = Other, Control Cf = Other, Format Cs = Other, Surrogate Co = Other, Private Use Cn = Other, Not Assigned Informative Lu = Letter, Uppercase Ll = Letter, Lowercase Lt = Letter, Titlecase Lm = Letter, Modifier Lo = Letter, Other Pc = Punctuation, Connector Pd = Punctuation, Dash Ps = Punctuation, Open Pe = Punctuation, Close Po = Punctuation, Other Sm = Symbol, Math Sc = Symbol, Currency Sk = Symbol, Modifier So = Symbol, Other It's not obvious how the Unicode-defined classes map onto the classes in the Report. My guess is: uniWhite == classes Zs, Zl, Zp UNIsmall == class Ll UNIlarge == classes Lu, Lt UNIsymbol == classes Sm, Sc, Sk, So UNIdigit == classes Nd, Nl, No nonbrkspc == "NO-BREAK SPACE" (\h00a0) However, it would also seem quite reasonable to include class Lo (which includes things like "Hebrew letter Alef") in UNIsmall or UNIlarge; and to include some of the Punctuation classes in UNIsymbol. 3) What does it mean that Char can include any Unicode character? If I compile and run the following program on my vanilla American UNIX box: main = putChar '\x2473' {- print a "circled number twenty" -} to get a program "ctwenty", and I run ./ctwenty | od -c (od prints out each byte of output), what will I see? Will the following program main = getChar = (print . fromEnum) ever print out a number greater than 256? If the answers to the above questions are "implementation dependent", what are some of the behaviors that implementations might plausibly have? Carl Witty [EMAIL PROTECTED]
Re: Haskell 1.4 and Unicode
Unicode was added at the last moment, so there is likely to be some descrepancies. 1) I assume that layout processing occurs after Unicode preprocessing; otherwise, you can't even find the lexemes. If so, are all Unicode characters assumed to be the same width? I think that's what is intended. However, it would also seem quite reasonable to include class Lo (which includes things like "Hebrew letter Alef") in UNIsmall or UNIlarge; and to include some of the Punctuation classes in UNIsymbol. It's hard to put Lo in a sensible place since Haskell relies on the upper/lower distinction. Therefore Lo is not included in upper or lower. 3) What does it mean that Char can include any Unicode character? It means that within a Haskell program Char can hold a Unicode character. If I compile and run the following program on my vanilla American UNIX box: main = putChar '\x2473' {- print a "circled number twenty" -} to get a program "ctwenty", and I run ./ctwenty | od -c (od prints out each byte of output), what will I see? Will the following program main = getChar = (print . fromEnum) ever print out a number greater than 256? The I/O library has not been converted to Unicode. So I would expect implementation to silently truncate Unicode characters to 8 bits. To do sensibly output (or input) of Unicode characters you need to encode them somehow. Hbc comes with encode/decode functions (in the Char library) for three encodings: two bytes per Char, UTF-8, and the Java encoding (\u). -- Lennart dogbert% cat ctwenty.hs import Char main = putStr (encodeUnicode "\x2473") dogbert% hbc ctwenty.hs -o ctwenty dogbert% ./ctwenty | od -c 000$ s 002 dogbert% cat ctwenty.hs import Char main = putStr (encodeUTF8 "\x2473") dogbert% hbc ctwenty.hs -o ctwenty dogbert% ./ctwenty | od -c 000 342 221 263 003 dogbert% cat ctwenty.hs import Char main = putStr (encodeEscape "\x2473") dogbert% hbc ctwenty.hs -o ctwenty dogbert% ./ctwenty | od -c 000\ u 2 4 7 3 006
Re: Haskell 1.4 and Unicode
I had option 1 in mind when that part of the report was written. We should clarify this in the next revision. And thanks for your analysis of the problem! John
small wart in the Report's description of the layout rule
The Haskell Report says: To facilitate the use of layout at the top level of a module (an implementation may allow several modules may reside in one file), the keyword module and the end-of-file token are assumed to occur in column 0 (whereas normally the first column is 1). Otherwise, all top-level declarations would have to be indented. I've read this many times without thinking about it; however, once I thought about it, it doesn't make sense. Following a module, the keyword "module" is "an illegal lexeme...encountered at a point where a close brace would be legal"; therefore, the close brace is properly inserted no matter what column "module" occurs in. Therefore, I suggest that the above paragraph be removed from the Report. Carl Witty [EMAIL PROTECTED]
Re: Haskell 1.4 and Unicode
Carl R. Witty wrote: 1) I assume that layout processing occurs after Unicode preprocessing; otherwise, you can't even find the lexemes. If so, are all Unicode characters assumed to be the same width? Unicode characters ***cannot in any way*** be considered as being of the same display width. Many characters have intrinsic width properties, like "halfwidth Katakana", "fullwidth ASCII", "ideographic space", "thin space", "zero width space", and so on (most of which are compatability characters, i.e. present only for conversion reasons). But more importantly there are combining characters which "modify" a "base character". For instance A (A with ring above) can be given as an A followed by a combining ring above, i.e. two Unicode characters. (For this and many others there is also a 'precomposed' character.) For many scripts vowels are combining characters. And there may be an indefinitely long (in principle, but three is a lot) sequence of combining characters after each non-combining character. What about bidirectional scripts? Especially for the Arabic script which is a cursive (joined) script, where in addition vowels are combining characters. Furthermore, Unicode characters in the "extended range" (no characters allocated yet) are encoded using two *non-character* 16-bit codes (when using UTF-16, which is the preferred encoding for Unicode). What would "Unicode preprocessing" be? UTF-16 decoding? Java-ish escape sequence decoding? ... 3) What does it mean that Char can include any Unicode character? I think it *does not* mean that a Char can hold any Unicode character. I think it *does* means that it can hold any single (UTF-16) 16-bit value. Which is something quite different. To store an arbitrary Unicode character 'straight off', one would need up to at least 21 bits to cover the UTF-16 range. ISO/IEC 10646-1 allows for up to 31 bits, but nobody(?) is planning to need all that. Some use 32-bit values to store Unicode characters. Perfectly allowed by 10646, though not by Unicode proper. Following Unicode proper one would always use sequence of UTF-16 codes, in order to be able to treat a "user perceived character" as a single entity both for UTF-16 reasons, and also for combining sequences reasons, independently of how the "user perceived character" was given as Unicode characters. /kent k PS Java gets some Unicode things wrong too. Including that Java's UTF-8 encoding is non-conforming (to both Unicode 2.0 and ISO/IEC 10646-1 Amd. 2).