Re: Unicode, character ambiguities

H. Peter Anvin Fri, 11 Jan 2002 13:03:42 -0800

Followup to:  <000801c19a78$28df7250$b4e21081@chalmers95a69n>
By author:    "Kent Karlsson" <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> Source separation rule also for the 8859 series of standards gives that
> they had to be separately encoded.
> 
> But even so, they had to be separated: similar-looking uppercase forms
> have different corresponding lowercase forms.  So as not to make case
> mapping horribly difficult (it's hard enough as it is!), Latin, Greek,
> and Cyrillic had to be non-unified.
>


Using that rule, there should be a TURKISH CAPITAL I which lowercase
as U+0131 LATIN SMALL LETTER DOTLESS I, and similarly, there should be
a TURKISH LOWER CASE I which uppercases as U+0130 LATIN CAPITAL LETTER
I WITH DOT.

However, I agree with you on the source separation rule, but I also
maintain that there was not much pressure to unify the alphabetic
scripts simply because of the small number of codepoints concerned:
all of Latin, Greek and Cyrillic including modifiers, combining
characters, numbers and unassigned codepoints still account for less
than 2000 codepoints; all the alphabetic scripts in the BMP combined
acount only for 8192 codepoints (not counting the Arabic presentation
forms and halfwidth/doublewidth compatibility variants, but counting
unallocated codepoints.)

        -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to