Re: Unicode: endpoint of evolution of encodings?

Danilo Segan Fri, 19 Nov 2004 12:11:19 -0800

Today at 12:52, Pablo Saratxaga wrote:

>> encodings).  I want to type "letters", and display it using any of
>> the scripts simply by changing a font.  I'm native Serbian, and most
>> native Serbian speakers tend to think of it as a display property (you
>
> Do they?
> Non-native names are written differently in cyrillic and in latin;
> for example "Chirac" vs "ÐÐÑÐÐ", or do people actually write "Åak 
> Åirak"
> instead of "Jacques Chirac" when writting an article about French
> president in Serbian in latin script?


Yes, I actually never saw "Chirac" in any Serbian text.  You can see
such usage mostly on IRC, e-mails etc (but you'd also see things like
"btw", "lol" and similar there, which means it's not real Serbian :).
I have yet to see a single newspaper or TV which would write "Chirac"
supposedly in Serbian.  It's Croatian where that's the rule.

>> Ok, read my "character" as "letter", if you use this definition of a
>> character.  So yes, Unicode is a collection of script symbols, which
>> you call characters, and I call glyphs :)
>
> No, a a character is an abstract entity.
> A glyph is not abstract, it is the visual representation of a character
> (or group of characters) with given font, style, weight, properties.
>
> A "letter" is a language-related abstract concept
> A "character" is a script-related abstract concept
> A "glyph" is related both to language and script and also to a given font.

Please stop insisting on your definitions.  No definition is
"absolutely correct", we can define words as we please, and use them
appropriately.  I already said I understood your definitions, and
indicated that misunderstandings are arising from our different
definitions.

When we talk about Unicode, we probably ought to use their
definitions, if for nothing else, to make it easier to communicate.

>> If characters are defined as script elements, then sure (after all,
>
> That is what it is imho.

See above.

>> is independent of the script).  I was clearly talking about characters
>> as letters, or elements used to write down a language.
>> 
>> If, OTOH, characters are defined as "the smallest component of written
>> language that has semantic value; refers to the abstract meaning
>> and/or shape, rather than a specific shape" (from Unicode Glossary,
>> cited above), then I'm not wrong at all: "Ð"/"a" both are smallest
>> components of written Serbian that have same semantic value,
>
> Yes, but unicode is not about Serbian only; so you cannot interpret
> that definition with such a narrow view.

Of course it's not about Serbian only, it's about any language.  It's
about Azeri (remember, 3 scripts have been used for it during 20th
century), and any other *language*.

Btw, my view is definitely not at all "narrow", it's in full
compliance with the definition, specifically suitable for it.  You're
again trying to pull out some tricks.  Please stop with that.

> Also, not how the semantic value is not about "language", but
> about "*written* language", that implies script, imho.

Now, you should have paid attention to entire glossary.  See "script"
and "writing system".  "Written language" is not defined, so it should
be treated as two separate words (one modifies the meaning of the
other).  Thus, it's more likely  "written language" to be in relation
with "writing system" ("A set of rules for using one or *more* scripts
to write a particular language.")  Note the usage of "write a
particular language", so, by using a "writing system", we get
"written language".  It is specifically indicated that 

>> refer to same abstract meaning, but not the same shape (ok, they're 
>> coincidentally the same shapes as well; I could have used Ð/d
>> instead).  I.e. they're the one and single character.
>
> But if you widen your interpretation to include another single
> language (eg Russian, or English, or whatever) it won't work anymore;
> particularly "Ð" is *not* the same abstract meaning than "l" in English.

Indeed â that's entire point.  We *need* language information to be
able to know what any single character represents (you know, that
"abstract meaning" blah blah and stuff).  There're many more
languages which are able to use multiple scripts.  I talk about
Serbian simply because I'm most familiar with it.

>> attainable through composing mechanisms).  So, Unicode is a glyph
>> repository, no matter what tricks you try to pull out :)
>
> I would accept it is a collection of graphemes (plus a few combining
> and modification characters), but not a glyph collection.

You don't give up, do you? :)

If we use Unicode definition of a grapheme:

   "A minimally distinctive unit of writing in the context of a
   particular writing system"

and if Unicode was really a collection of "graphemes", there would be
no rule "no more precomposed characters in Unicode", since that's what
many precomposed characters are!

If precomposed characters are not distinctive units of their
respective writing systems, then "i" is not distinctive unit either
(it can be composed out of "combining dot above" and "dotless i" in
Unicode).

Hey, I'd love it if you were right, and Unicode really was a
"collection of graphemes", but it's simply not true.  I don't see why
you're trying to defend Unicode so much.  It has some useful and good
properties, but it's not The Perfection itself.

> But note also that several of the encoded characters are there for
> compatibility, and should not be used (that is the case of the latin
> digraphs "lj" etc, you should not use them, you should use "l" and "j"
> separately.

Yeah, I know that Unicode is a mess because of compatibility (I
understand that this mess actually made it used in practice, not any
of its magnificient properties, which there're aplenty), and that's
not what I'm complaining about.

It's just that I recognize that characters have more properties which
could be "encoded".

> Ah, and note that, in case of the digraphs, there is not any single
> casing pair with cyrillic; proper casing of "lj" depends of context
> (in cyrillic you only have a lowercase and an uppercase: Ñ/Ð, while in
> latin you have *three* possibilities: lj/Lj/LJ;
> in cyrillic you can have ÑÐÐÐÐ ÐÐÐÐÐ ÐÐÐÐÐ in latin it is 
> ljanka Ljanka
> LJANKA; there is no one-to-one matching, but two-to-three, so, you
> cannot achieve your goal of "encoding of Serbian independent of cyrillic
> or latin display" either, unless you encode three casing states for
> each letter: lowercase, initial-uppercase, all-uppercase.

This can be trivially solved with ligatures, which are supported on
any sufficiently modern system even today (i.e. if Ð is to the left or
to the right of any other uppercase letter, use another form), not to
count such ancient software such as TeX :)  

What I mean to say is that I can even today choose a different font
which will display Serbian in Latin script correctly, even though it
is encoded with Unicode Cyrillic region.  So, it's a display property.

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode: endpoint of evolution of encodings?

Reply via email to