Unicode was added at the last moment, so there is likely to
be some descrepancies.
> 1) I assume that layout processing occurs after Unicode preprocessing;
> otherwise, you can't even find the lexemes. If so, are all Unicode
> characters assumed to be the same width?
I think that's what is intended.
> However, it would also seem quite reasonable to include class Lo
> (which includes things like "Hebrew letter Alef") in UNIsmall or
> UNIlarge; and to include some of the Punctuation classes in UNIsymbol.
It's hard to put Lo in a sensible place since Haskell relies on
the upper/lower distinction. Therefore Lo is not included
in upper or lower.
> 3) What does it mean that Char can include any Unicode character?
It means that within a Haskell program Char can hold a Unicode character.
> If I compile and run the following program on my vanilla American UNIX
> box:
>
> main = putChar '\x2473' {- print a "circled number twenty" -}
>
> to get a program "ctwenty", and I run
>
> ./ctwenty | od -c
>
> (od prints out each byte of output), what will I see?
>
> Will the following program
>
> main = getChar >>= (print . fromEnum)
>
> ever print out a number greater than 256?
The I/O library has not been converted to Unicode. So I would
expect implementation to silently truncate Unicode characters
to 8 bits.
To do sensibly output (or input) of Unicode characters you need to
encode them somehow. Hbc comes with encode/decode functions (in the
Char library) for three encodings: two bytes per Char, UTF-8, and the
Java encoding (\uXXXX).
-- Lennart
dogbert% cat ctwenty.hs
import Char
main = putStr (encodeUnicode "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
0000000 $ s
0000002
dogbert% cat ctwenty.hs
import Char
main = putStr (encodeUTF8 "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
0000000 342 221 263
0000003
dogbert% cat ctwenty.hs
import Char
main = putStr (encodeEscape "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
0000000 \ u 2 4 7 3
0000006