Haskell 1.4 and Unicode

1997-11-07 Thread Carl R. Witty

I have some questions regarding Haskell 1.4 and Unicode.  My source
materials for these questions are "The Haskell 1.4 Report" and the
files

ftp://ftp.unicode.org/Public/2.0-Update/ReadMe-2.0.14.txt   
  and
ftp://ftp.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt

It's possible that question 2 below would be resolved if I actually
read the Unicode book; if so, I apologize in advance.

1) I assume that layout processing occurs after Unicode preprocessing;
otherwise, you can't even find the lexemes.  If so, are all Unicode
characters assumed to be the same width?

2) The Report uses the following classes of characters:
uniWhite - any UNIcode character defined as whitespace
nonbrkspc ???
UNIsmall - any Unicode lowercase letter
UNIlarge - any uppercase or titlecase Unicode letter
UNIsymbol - Any Unicode symbol or punctuation
UNIdigit - A Unicode numberic

The file ReadMe-2.0.14.txt above defines the following classes of
characters:

Normative
Mn = Mark, Non-Spacing
Mc = Mark, Spacing Combining
Me = Mark, Enclosing

Nd = Number, Decimal Digit
Nl = Number, Letter
No = Number, Other

Zs = Separator, Space
Zl = Separator, Line
Zp = Separator, Paragraph

Cc = Other, Control
Cf = Other, Format
Cs = Other, Surrogate
Co = Other, Private Use
Cn = Other, Not Assigned

Informative
Lu = Letter, Uppercase
Ll = Letter, Lowercase
Lt = Letter, Titlecase
Lm = Letter, Modifier
Lo = Letter, Other

Pc = Punctuation, Connector
Pd = Punctuation, Dash
Ps = Punctuation, Open
Pe = Punctuation, Close
Po = Punctuation, Other

Sm = Symbol, Math
Sc = Symbol, Currency
Sk = Symbol, Modifier
So = Symbol, Other

It's not obvious how the Unicode-defined classes map onto the classes
in the Report.  My guess is:

uniWhite == classes Zs, Zl, Zp
UNIsmall == class Ll
UNIlarge == classes Lu, Lt
UNIsymbol == classes Sm, Sc, Sk, So
UNIdigit == classes Nd, Nl, No
nonbrkspc == "NO-BREAK SPACE" (\h00a0)

However, it would also seem quite reasonable to include class Lo
(which includes things like "Hebrew letter Alef") in UNIsmall or
UNIlarge; and to include some of the Punctuation classes in UNIsymbol.

3) What does it mean that Char can include any Unicode character?

If I compile and run the following program on my vanilla American UNIX
box:

main = putChar '\x2473' {- print a "circled number twenty" -}

to get a program "ctwenty", and I run

./ctwenty | od -c

(od prints out each byte of output), what will I see?

Will the following program

main = getChar = (print . fromEnum)

ever print out a number greater than 256?

If the answers to the above questions are "implementation dependent",
what are some of the behaviors that implementations might plausibly
have?

Carl Witty
[EMAIL PROTECTED]






Re: Haskell 1.4 and Unicode

1997-11-07 Thread Lennart Augustsson


Unicode was added at the last moment, so there is likely to
be some descrepancies.

 1) I assume that layout processing occurs after Unicode preprocessing;
 otherwise, you can't even find the lexemes.  If so, are all Unicode
 characters assumed to be the same width?
I think that's what is intended.

 However, it would also seem quite reasonable to include class Lo
 (which includes things like "Hebrew letter Alef") in UNIsmall or
 UNIlarge; and to include some of the Punctuation classes in UNIsymbol.
It's hard to put Lo in a sensible place since Haskell relies on
the upper/lower distinction.  Therefore Lo is not included
in upper or lower.

 3) What does it mean that Char can include any Unicode character?
It means that within a Haskell program Char can hold a Unicode character.

 If I compile and run the following program on my vanilla American UNIX
 box:
 
   main = putChar '\x2473' {- print a "circled number twenty" -}
 
 to get a program "ctwenty", and I run
 
   ./ctwenty | od -c
 
 (od prints out each byte of output), what will I see?
 
 Will the following program
 
   main = getChar = (print . fromEnum)
 
 ever print out a number greater than 256?
The I/O library has not been converted to Unicode.  So I would
expect implementation to silently truncate Unicode characters
to 8 bits.

To do sensibly output (or input) of Unicode characters you need to
encode them somehow.  Hbc comes with encode/decode functions (in the
Char library) for three encodings: two bytes per Char, UTF-8, and the
Java encoding (\u).

-- Lennart

dogbert% cat ctwenty.hs
import Char
main = putStr (encodeUnicode "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
000$   s
002

dogbert% cat ctwenty.hs
import Char
main = putStr (encodeUTF8 "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
000  342 221 263
003

dogbert% cat ctwenty.hs
import Char
main = putStr (encodeEscape "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
000\   u   2   4   7   3
006






Re: Haskell 1.4 and Unicode

1997-11-07 Thread John C. Peterson

I had option 1 in mind when that part of the report was written.  We
should clarify this in the next revision.

And thanks for your analysis of the problem!

   John








small wart in the Report's description of the layout rule

1997-11-07 Thread Carl R. Witty

The Haskell Report says:

To facilitate the use of layout at the top level of a module (an
implementation may allow several modules may reside in one file), the
keyword module and the end-of-file token are assumed to occur in
column 0 (whereas normally the first column is 1). Otherwise, all
top-level declarations would have to be indented.

I've read this many times without thinking about it; however, once I
thought about it, it doesn't make sense.  Following a module, the
keyword "module" is "an illegal lexeme...encountered at a point where
a close brace would be legal"; therefore, the close brace is properly
inserted no matter what column "module" occurs in.  Therefore, I
suggest that the above paragraph be removed from the Report.

Carl Witty
[EMAIL PROTECTED]






Re: Haskell 1.4 and Unicode

1997-11-07 Thread Kent Karlsson

Carl R. Witty wrote:

 1) I assume that layout processing occurs after Unicode preprocessing;
 otherwise, you can't even find the lexemes.  If so, are all Unicode
 characters assumed to be the same width?

Unicode characters ***cannot in any way*** be considered as being of
the same display width.  Many characters have intrinsic width properties,
like "halfwidth Katakana", "fullwidth ASCII", "ideographic space",
"thin space", "zero width space", and so on (most of which are
compatability characters, i.e. present only for conversion reasons).
But more importantly there are combining characters which "modify"
a "base character". For instance A (A with ring above) can be given
as an A followed by a combining ring above, i.e. two Unicode characters.
(For this and many others there is also a 'precomposed' character.) 
For many scripts vowels are combining characters.  And there may be an
indefinitely long (in principle, but three is a lot) sequence of
combining characters after each non-combining character.

What about bidirectional scripts?  Especially for the Arabic
script which is a cursive (joined) script, where in addition
vowels are combining characters.

Furthermore, Unicode characters in the "extended range" (no characters
allocated yet) are encoded using two *non-character* 16-bit codes
(when using UTF-16, which is the preferred encoding for Unicode).

What would "Unicode preprocessing" be?  UTF-16 decoding?
Java-ish escape sequence decoding?

...
 3) What does it mean that Char can include any Unicode character?

I think it *does not* mean that a Char can hold any Unicode 
character.  I think it *does* means that it can hold any single
(UTF-16) 16-bit value.  Which is something quite different.  To store
an arbitrary Unicode character 'straight off', one would need up
to at least 21 bits to cover the UTF-16 range.  ISO/IEC 10646-1 allows
for up to 31 bits, but nobody(?) is planning to need all that.
Some use 32-bit values to store Unicode characters.  Perfectly
allowed by 10646, though not by Unicode proper.  Following Unicode
proper one would always use sequence of UTF-16 codes, in order to
be able to treat a "user perceived character" as a single entity
both for UTF-16 reasons, and also for combining sequences reasons,
independently of how the "user perceived character" was given as
Unicode characters.

/kent k

PS
Java gets some Unicode things wrong too.  Including that Java's
UTF-8 encoding is non-conforming (to both Unicode 2.0 and ISO/IEC
10646-1 Amd. 2).