Re: Unicode, character ambiguities

Pablo Saratxaga Fri, 11 Jan 2002 07:23:55 -0800

Kaixo!

On Fri, Jan 11, 2002 at 02:02:05PM +0000, Edmund GRIMLEY EVANS wrote:


> You've described the situation, but you haven't answered the question.
> 
> The obvious alternative would be to have 6 characters: upper and lower
> case versions of "ordinary I", "Turkish/Azeri dotted I" and
> "Turkish/Azeri dotless I".

Ok, so, yes, those 6 chars ahve been unified into 4.
 
> It would be interesting to know whether this alternative is ever used,
> in some encoding, was ever considered for Unicode, etc.

I never heard of any turkish encoding with 3 kind of "i" chars; only 2,
and I never heard of trukish encodings incompatible withs ascii/ebcdic.
So, that unification was in fact done by turkish encodings, and unicode
just had to pick up what existed. 

Now, it is indeed a situation very similar to the one discussed here
about CJK unification: Turkish expect "i" and "I" (the "ordinary" ones)
to be always with and without a dot respectivley.
The other two have the dot explicitely removed or added, they will always
look correctly for Turkish people; however, it is not the case for the
"ordinary" one: for example, a font that uses small caps for lower case will
have an "i" (lowercase) that is actuale dotless.
Of course, in such case Turkish people would need a turkish-specific font
to have proper display.

Now, a mixed text Turkish/English (or other language) written in smallcaps
with a turkish font will have "i" of English text show a bit different of
what one can expect; but still readable.
On the other way, using a non turkish font, the situation is a worse, as
Turkish text will not simply look different, but wrong.

The similarity with CJK unification exists fro the first case (a mixed 
languages text will have portions written in other languages look maybe
a bit different of what you coumld expect from a text written in those
languages; but still readable without problems).
The similarity with the second case I don't think if it exists (a unified
CJK char that is displayed in a given font with the glyph corresponding
to what another font uses for a different codepoint).

To tell a bit more about Japanese and unicode:

1. for Japanese only texts (or Japanese and other non-CJK languages) there
   isn't any problem problem at all.
2. the problem arises for texts having Japanese and at least one of
   traditionnal Chinese, simplified Chinese, Korean hanja, or old vietnamese
   characters.
   In case of plain text, with only one font (or a fontset) usable, the 
   choices are:
  - use as default the user native font (a japanese font), and have chars 
    (or maybe even all) of the foreing text rendered with the japanese font.
  - use a foreing font as default, and have some (or maybe all) of japanese
    chars rendered with it.

I don't think there is any readibility problem for case 2., whatever the
choice of the font will be (and it is up to the user to choose the one
he likes better) he will be able to read; after all, he can read both
Japanese and one of those foreign languages, isn't it?
If he couldn't read those languages, then the discussion about the glyphs
used to render it is pointless (and even more: I'm inclined to believe
in such case the user will prefere to have all the text displayed in a
font habitual for him. For example, I don't read German, so I would much
prefer in case of a mixed text have German portions displayed with the
same font rather than displayed with Fraktur font).

The Fraktur thread is interesting indeed, as that is the key of the problem:
some people want to be able to have an English/Gaelic/German plain text
with English portions in Arial, Gaelic in some celtic-style font, and German
in Fraktur.

My opinion is that those fonts or font-style specifications don't belong
in plain text.

I also thing that it is not always a good idea to do that (A text in 
Arial/Uncial/Fraktur intermixed will be very ugly, imho), so it should not
be done by default.

And for people for whom the rendering and layout is very important to the
point they want to precisely apply given fonts to given portions of the
text (depending on the language, but it can be depending on any other
criteria); then there are wordprocessors, markup languages for web pages,
etc.

plain text must remain simple; its purpose is to be the simplest possible,
to be the smaller common denominator for all cases; so most probably
people implementing plain text interfaces (for GSM phones, for watches,
for coffee machines, etc) will keep it simple, and never implement support
for language tags and such.

For ogg sound file format, that started the tread, I think it will be
useless to introduce language tags in the standard definition.
It will most probably won't be implemented by the majority of programs
or appliances; worst: as it will complexify the definition (it won't be
as simple as "the text is encoded in utf-8." but will go for lines and
lines to explain what to do with the tags and examples etc), some people
will found it too complex and simply ignore the standard, inventing their
own solution, so creating incompatibilities.



-- 
Ki �a vos v�ye b�n,
Pablo Saratxaga

http://www.srtxg.easynet.be/            PGP Key available, key ID: 0x8F0E4975

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to