Re: Unicode

2000-05-16 Thread George Russell

Marcin 'Qrczak' Kowalczyk wrote:
 As for the language standard: I hope that Char will be allowed or
 required to have =30 bits instead of current 16; but never more than
 Int, to be able to use ord and chr safely.
Er does it have to?  The Java Virtual Machine implements Unicode with
16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)
So requiring Char to be =30 bits would stop anyone implementing a
conformant Haskell on the JVM.  (I feel strongly about this having been
involved with MLj, which compiles ML to JVM; Standard ML requires 8-bit
chars, a requirement we decided to ignore.)




RE: Unicode

2000-05-16 Thread Simon Marlow

  OTOH, it wouldn't be hard to change GHC's Char datatype to be a
  full 32-bit integral data type.
 
 Could we do it please?
 
 It will not break anything if done slowly. I imagine that
 {read,write}CharOffAddr and _ccall_ will still use only 8 bits of
 Char. But after Char is wide, libraries dealing with text conversion
 will be possible to be designed, to prepare for future international
 I/O, together with Foreign libraries.

I agree it should be done.  But not for 4.07; we can start breaking the tree
as soon as I've forked the 4.07 branch though (hopefully today...).

We have some other small wibbles to deal with; currently a Char never
resides in the heap, because there are only 256 possible Chars we declare
them all statically in the RTS.  Now we have to check whether the Char falls
in the allowed range before using this table (that's fairly easy, we already
do this for Int).

Cheers,
Simon




Re: Unicode

2000-05-16 Thread Frank Atanassow

George Russell writes:
  Marcin 'Qrczak' Kowalczyk wrote:
   As for the language standard: I hope that Char will be allowed or
   required to have =30 bits instead of current 16; but never more than
   Int, to be able to use ord and chr safely.
  Er does it have to?  The Java Virtual Machine implements Unicode with
  16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)

Just to set the record straight:

Many CJK (Chinese-Japanese-Korean) characters are encodable in 16 bits. I am
not so familiar with the Chinese or Korean situations, but in Japan there is a
nationally standardized subset of about 2000 characters called the Jyouyou
("often-used") kanji, which newspapers and most printed books are mostly
supposed to respect. These are all strictly contained in the 16-bit space. One
only needs the additional 16-bits for foreign characters (say, Chinese), older
literary works and such-like. Even then, since Japanese has two phoenetic
alphabets as well, and you can usually substitute phoenetic characters in the
place of non-Jyouyou kanji---in fact, since these kanji are considered
difficult, one often _does_ supplement the ideographic representation with a
phoenetic one. Of course, using only phoenetic characters in such cases would
look unprofessional in some contexts, and it forces the reader to guess at
which word was meant...

For Korean and especially Chinese, the situation is not so pat. Korean's
phoenetic alphabet is of course wholly contained within the 16 bit space, but
Chinese, as a rule, don't use phoenetic characters. Koreans rely on their
phoenetic alphabet more than the Japanese, but they still tend to use (I
believe) more esoteric Chinese ideographic characters than the Japanese
do. And the Chinese have a much larger set of ideographic characters in common
use than either of the other two. I'm not sure what percentage is contained in
the 16-bit space; it's probably enough that you can communicate most
non-specialized subjects fairly comfortably, but it is safe to say that the
Chinese would be the first to demand more encoding space.

In summary, 16 bits is enough to encode most modern texts if you don't mind
fudging a bit, but for high-quality productions, historical and/or specialized
texts, CJK users will want 32 bits.

Of course, you can always come up with specialized schemes involving stateful
encodings and/or "block-swapping" (using the Unicode private-use areas, for
example), but then, that subverts the purpose of Unicode.

-- 
Frank Atanassow, Dept. of Computer Science, Utrecht University
Padualaan 14, PO Box 80.089, 3508 TB Utrecht, Netherlands
Tel +31 (030) 253-1012, Fax +31 (030) 251-3791





Re: Unicode

2000-05-16 Thread Marcin 'Qrczak' Kowalczyk

Tue, 16 May 2000 10:44:28 +0200, George Russell [EMAIL PROTECTED] pisze:

  As for the language standard: I hope that Char will be allowed or
  required to have =30 bits instead of current 16; but never more than
  Int, to be able to use ord and chr safely.
 
 Er does it have to?  The Java Virtual Machine implements Unicode with
 16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)
 So requiring Char to be =30 bits would stop anyone implementing a
 conformant Haskell on the JVM.

OK, "allowed", not "required"; currently it is not even allowed.
The minimum should probably be 16, maximum - the size of Int.

Oops, ord will have to be allowed to return negative numbers when
the size of Char is equal to the size of Int. Another solution is to
make Char at least one bit less than Int, or also at the same time
no larger than 31 bits. ISO-10646 describes the space of 31 bits,
UTF-8 is able to encode up to 31 bits, so then a UTF-8 encoder would
be portable without worrying about Char values that don't fit, and
a decoder could easily check if a character is representable in Char:
ord maxBound would be guaranteed to be positive.

Choices I see:
- 30 = Int, 16 = Char = 31, Char   Int
- 30 = Int, 16 = Char,   Char   Int
- 30 = Int, 16 = Char,   Char = Int

-- 
 __("Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/  GCS/M d- s+:-- a23 C+++$ UL++$ P+++ L++$ E-
  ^^  W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP+ t
QRCZAK  5? X- R tv-- b+++ DI D- G+ e h! r--%++ y-





Re: Unicode

2000-05-16 Thread Marcin 'Qrczak' Kowalczyk

Tue, 16 May 2000 12:26:12 +0200 (MET DST), Frank Atanassow [EMAIL PROTECTED] pisze:

 Of course, you can always come up with specialized schemes involving stateful
 encodings and/or "block-swapping" (using the Unicode private-use areas, for
 example), but then, that subverts the purpose of Unicode.

There is already a standard UTF-16 encoding that fits 2^20 characters
into 16bit space, by encoding characters =2^16 as pairs of "characters"
from the range D800..DFFF, which are otherwise unused in Unicode.

Programmers should not be expected to care about this; most will not
anyway. Libraries will handle this format in external UTF-16-encoded
strings.

UTF-8 is usually a better choice for external encoding; UTF-16 should
be rarely used.

-- 
 __("Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/  GCS/M d- s+:-- a23 C+++$ UL++$ P+++ L++$ E-
  ^^  W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP+ t
QRCZAK  5? X- R tv-- b+++ DI D- G+ e h! r--%++ y-





Re: Unicode

2000-05-16 Thread Manuel M. T. Chakravarty

Frank Atanassow [EMAIL PROTECTED] wrote,

 George Russell writes:
   Marcin 'Qrczak' Kowalczyk wrote:
As for the language standard: I hope that Char will be allowed or
required to have =30 bits instead of current 16; but never more than
Int, to be able to use ord and chr safely.
   Er does it have to?  The Java Virtual Machine implements Unicode with
   16 bits.  (OK, so I suppose that means it can't cope
   with Korean or Chinese.) 
 
 Just to set the record straight:
 
 Many CJK (Chinese-Japanese-Korean) characters are
 encodable in 16 bits. I am not so familiar with the
 Chinese or Korean situations, but in Japan there is a
 nationally standardized subset of about 2000 characters
 called the Jyouyou ("often-used") kanji, which newspapers
 and most printed books are mostly supposed to
 respect. These are all strictly contained in the 16-bit
 space. One only needs the additional 16-bits for foreign
 characters (say, Chinese), older literary works and
 such-like. Even then, since Japanese has two phoenetic
 alphabets as well, and you can usually substitute
 phoenetic characters in the place of non-Jyouyou
 kanji---in fact, since these kanji are considered
 difficult, one often _does_ supplement the ideographic
 representation with a phoenetic one. Of course, using only
 phoenetic characters in such cases would look
 unprofessional in some contexts, and it forces the reader
 to guess at which word was meant...

The problem with restricting youself to the Jouyou-Kanji is
that you have a hard time with names (of persons and
places).  Many exotic and otherwise unused Kanji are used in
names (for historical reasons) and as the Kanji
representation of a name is the official identifier, it is
rather bad form to write a person's name in Kana (the
phonetic alphabets).

Cheers,
Manuel