Re: Unicode: endpoint of evolution of encodings?

Danilo Segan Thu, 18 Nov 2004 16:47:55 -0800

Hi Pablo,

Today at 23:17, Pablo Saratxaga wrote:


> It is indeed a good feature to do so;
> but the *smallest* unit for which language information is usefull
> are *words*, not characters/letters.

Indeed.  But how do you achieve that?  It's easiest to have characters
hold language information.  Or, keeping language markers intertangled
with text itself?

>> So, "jota" would still make sense in Spanish, whatever it was
>> pronounced as, but not much sense in English (since it's not a word
>> there).  I think this is a good property to know.
>
> No, it is useless. The letter "j", alone, is the same letter on all
> languages using the latin script. There is absolutely no gain in
> creating differences based on language (plus, I know of no language
> where there is a word consisting of the single letter "j").

I disagree: there's a lot to be gained by creating differences based
on language, and I already gave examples of what could be gained.

On the topic of letters, is "Å" the same letter in both Croatian,
English and Spanish (they all use Latin script, after all :)?

> "disambiguating" letters depending on the language is a very bad idea,
> beacause it destroys the interexchangeability of documents.

Uhm, how come?  Care to elaborate?  Why is using another (standard,
provided it becomes one) encoding destroying interchangability of
documents?  It would destroy it as much as using UTF-16 instead of
UTF-8 would, or as much as using Unicode over ISO-8859-1 would: your
software would have to know how to interpret it and map between them.

> You have problems to do google searchs in Serbian because a text
> can be in two different scripts; 

I'm actually more concerned with the display and input problem,
rather than doing Google searches (I mentioned Google only to show
that people care about language more, yet Google is not able
to deduce such information correctly with the current state of
encodings).  I want to type "letters", and display it using any of
the scripts simply by changing a font.  I'm native Serbian, and most
native Serbian speakers tend to think of it as a display property (you
certainly know that, since I know you're well clued about Serbian
problems :).

If Unicode fails for my language, how can you claim that it's
completely correct?  It's simply not, it might work for "mostly",
but it doesn't make it correct.  If your car explodes "only" every
100th time you start it, would you drive it at all?

> now with your idea of disambiguating letters it means that the same
> problem will exist for almost all languages (minus the very few ones
> using a unique script), it would even be worst, as a same English
> text, for example, could be encoded in dozens of different (eg: in
> English-letters, Spanish-letters, Portuguese-letters,
> French-letters, German-letters, Italian-letters, Indonesian-letters,
> Polish-letters, Irish-letters, Welsh-letters, Danish-letters,...).

Well, it's not up to encoding to enforce correct usage of it.  After
all, one can type text using "small caps" region of Unicode standard,
but how often does that happen?

The real issue is estimating how common such misuse would be?  I
believe most people input their text with correct language selection,
but this cannot be proved either right or wrong without doing a real
world, large-scale experiment.

After all, it would still be trivial to dump language data from
characters and to map all data to a glyph repository such as Unicode
or AFII, depending on the application.  If the problem was so visible,
no search engine would have problems with it (but it could still
make use of language properties when user explicitely asks for it).

>> We must agree that these differences Unicode went after are
>> glyph-based, rather than character-based.
>
> They are character based.
> With a character defined as an atomic element of a script (there are of
> course a lot of exceptions due to historical reasons, but that is the
> basic idea).

Ok, read my "character" as "letter", if you use this definition of a
character.  So yes, Unicode is a collection of script symbols, which
you call characters, and I call glyphs :)

But, that was not the intention of Unicode.

FWIW, Unicode definition of a character[1] would allow (even prefer)
my interpretation as well:

> Character. (1) The smallest component of written language that has
> semantic value; refers to the abstract meaning and/or shape, rather
> than a specific shape (see also glyph), though in code tables some
> form of visual representation is essential for the readerâs
> understanding. (2) Synonym for abstract character. (3) The basic
> unit of encoding for the Unicode character encoding. (4) The English
> name for the ideographic written elements of Chinese origin. (See
> ideograph (2).) 

There's no mention of script here (there is of "language"), and I'd
certainly consider "smallest component of written language that has
semantic value" a letter.  See also [2], where it is clearly pointed
out that letter is closely tied to character (i.e. character is
encompassing/superset concept, not a different concept as you try to
put it).

Your view on "character" more closely resembles "grapheme" according
to [3] (and (2) in there explicitely states that users commonly think
of grapheme as character, but that's not what Unicode considers it as).

[1] http://www.unicode.org/glossary/#character
[2] http://www.unicode.org/glossary/#letter
[2] http://www.unicode.org/glossary/#grapheme
 
> So, unicode is a collection of *scripts*, each script is separate and
> independent of the others, and each script is a collection of characters

But that's completely false, and you know it: scripts are not
independent, except somewhat in their graphic/display properties!
Scripts commonly have mappings between them, depending on the
_language_ of use!  Think of Pin-Yin as a relation between otherwise
unrelated scripts.  Many languages are multi-script.  Are you saying
that digraphs (Ç, Ç, Ç, Ç, Ç, Ç) are completely independent of
Serbian Cyrillic script?

> belonging to that script (there are some special characters, like
> generic puntuation and ascii digits, that can be used in conjunction
> with most scripts, but outside the shared puntuation characters, the
> different characters are exclusive to a given script, even if there are
> similarities in some cases with other characters of another script).

> The basic concept to encode writing is the script, that is so when
> electronically encoding text simply because that is so when writting
> text by hand or press.

But script also depends on the language, and that's my entire point.
You can claim that's not so as much as you wish, but there're many
differences between Serbian Cyrillic and Russian Cyrillic: they even
use completely different glyphs ("Ð" and "Ñ") for arguably the same
sound.

Writing text by hand or press matrices are not a really good
examples: many Cyrillic or Greek (eg. uppercase Greek in TeX)
characters can be gotten using Latin forms (i.e. this is more example
of glyph usage, not of characters).  I.e. it proves nothing, except
that it doesn't prove anything :)

>> I say that "a" and "Ð" are same characters in Serbian,
>
> They are not.
> They may be the same *letter* in Serbian.
> But a letter is not a character (in Spanish, "ch" is a letter (yes, I'm
> a traditionalist), as well in Serbian "lj" and "nj" are letters;
> however the involved characters are "c", "h", "l", "n", "j".
> Note also how in cyrillic script "Ñ" and "Ñ" are single caracters,
> note also that "ÐÑ" and "ÐÑ" are not single characters.

Ok, we've used different definitions of a "character".  If I accept
your definition, then you're correct.  What I meant is, of course,
that they're same letters (using your definition).

> You wrongly see latin and cyrillic variants of Serbian as simple
> differences in shape of the same characters; that is not so,
> you should instead look at it as two orthographic variants.

If characters are defined as script elements, then sure (after all,
I'm not that dumb to claim that something defined as script element
is independent of the script).  I was clearly talking about characters
as letters, or elements used to write down a language.

If, OTOH, characters are defined as "the smallest component of written
language that has semantic value; refers to the abstract meaning
and/or shape, rather than a specific shape" (from Unicode Glossary,
cited above), then I'm not wrong at all: "Ð"/"a" both are smallest
components of written Serbian that have same semantic value, and
refer to same abstract meaning, but not the same shape (ok, they're 
coincidentally the same shapes as well; I could have used Ð/d
instead).  I.e. they're the one and single character.

You're trying to pull up a trick on me with your word usage :)
But this only points out all the problems with Unicode (i.e. policy
such as "no more precomposed glyphs" means that some characters will
not be encoded, but that glyphs of those characters would be
attainable through composing mechanisms).  So, Unicode is a glyph
repository, no matter what tricks you try to pull out :)

Finally, I'm not saying Unicode is fundamentally bad, only that there
could be something even better for encoding textual data.


Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode: endpoint of evolution of encodings?

Reply via email to