>> André Pönitz wrote:

>> > As far as I know there are already more than 2^16 chinese characters if
>> > all historic variants are taken into account. 2^16 gets tight if
>> > artificial scripts like Klingon are included. If one starts again with
>> > "code pages" and similar, all the old cruft is back. 32 bits is the way
>> > to go...

Well, only if there is a need to have a fixed character width for some
reason.

Remember that Unicode has been in use for a while, that there are
various representation formats standardized by Unicode for various
purposes, and that there are libraries available to facilitate string
lengths and similar issues. Unless there is a significant speed hit to
be expected from not having fixed bit lengths of characters, there is
no reason to use 32 bits at all, as there are reasonable more compact
representations of Unicode.

In a somewhat abbreviated manner: everything that has to do with index
positions in a string, and, especially, backwards searching gets a
performance hit of an extra O(n) in the worst case with all Unicode
representations except 32-bit UCS-4.

KO> But it's probably very true that just using a 32-bit encoding with
KO> *mostly* one-to-one mapping between characters and dwords is easy
KO> to use. But not  always easy to use.

Do you have an counterexample of a Unicode character that can't be
mapped to a single UCS-4-encoded dword?

KO> - for arabic languages, at least, there is no one-to-one relationship between
KO> ucs4 dwords and character spaces / unique cursor positions -- so that benefit 
KO> of ucs4 in the general case is dead gone methinks

Characters, spaces and unique cursor positions have very little to do
with each other. You have to distinguish between characters and
glyphs. Unicode operates on the character level, as does most of all
text processing. Characters are what is represented internally in the
backing store; there is no difference in representation for the
various Arabic contextual forms of a character. Unicode U+0645 ARABIC
LETTER MEEM will always be represented as U+0645, regardless of where
it appears in the word.

Glyphs are what gets displayed. This happens in the frontend and
should not affect the representation-related design decisions at all.
The frontend offers a function to convert a character string into a
glyph image, using whatever script-related information is implemented
in the respective toolkit. On X, Gnome has a really good toolkit here,
Qt 3 is still OK, and xforms is not really brilliant IIRC.

KO> - since that essentially breaks the simple "full information about one
KO> character/composite glyph per 32 bits" assumption, one could as well go with 
KO> utf8, right?

No.

With UCS4, one character will essentially always have the same width
(bit width, that is). Every Unicode character is 32 bits wide, with no
exception. Actually, it is the only Unicode format that offers this,
and this fixed-width property is the main reason why people use UCS-4
at all.

Composite glyphs are a completely different matter, they are just
glyphs that consist of two distinct Unicode characters, such as "e" +
"´" to form "é".

The only assumption that is gone is that two characters should be
displayed in separate places on screen. But, as I said, this is a
toolkit problem that ideally we don't have to care about at all, and
it is not substantially different from using proportional fonts with
the present system, where you don't have a fixed one-to-one mapping
from character to screen position either.

KO> It seems that utf8 is still more compact than 32 bits per unicode table entry, 
KO> in the worst case.

UTF-8 has a maximum character width of 4 bytes. The main disadvantage
of UTF-8 is that there is a possibility of accidentally generating
malformed bit sequences, which means a lot of extra special cases in
debugging. On the positive side, it's binary compatible with existing
8-bit systems.

Cheers -
  Philipp Reichmuth                            mailto:[EMAIL PROTECTED]

--
Having been erased, / The document you're seeking / Must now be retyped

Reply via email to