Re: Unicode: endpoint of evolution of encodings?

Danilo Segan Wed, 17 Nov 2004 02:41:38 -0800

Hi,

Please don't use HTML mail, I have problems reading it, and it messes
up encoding for me (since I have to use sort of "view source").

Today at 1:31, srintuar wrote:

> This may be more of a practical issue: for some scripts such as Korean,
> representing every possible character and partial character could
> require a very large amount of codespace. We only have the precomposed
> characters now for compatibility with platforms that simply dont support
> composition whatsoever (all too common still, sadly).

I understand that, and with evolution of computers and encodings, I
hope that will changeâthat's the entire point :) 

> For example: do these both work under your mailreader?

No, Emacs (Gnus) doesn't have a "combining" features, so it just
lists the characters one after another.  Pango (I use Gnome for
everything else) should do it at least a bit better, but the results
will probably not be the same.

> When you have multilingual documents you can more easily see why
> that is impractical. There is no easy way for a piece of software to
> know that some words are Spanish and some are English. If the two
> languages had no overlaping codepoints whatsoever you could very
> easily end up with English text encoded with Spanish codepoints
> and vice versa.

Yes, I'm aware of the practical problems of inputting text properly.
But, most of text would be input with such information known.
I.e. Spanish natives probably have "Spanish keyboard layout" active,
just like I have "Serbian keyboard layout" active when I type
Serbian.  That means that this will cause problems in several cases:
 â already existing text
 â text input by non-natives whose language uses the same script

I'd argue that the first problem would be more common (just like we
have problems of switching from ISO-8859-* and other 8-bit encodings
to UTF-8 now).

Still, that doesn't mean that it's impossible (using UTF-8 would be
considered highly impractical 15 years ago), but that it simply
doesn't work right now.

> Even characters which look different in different scripts but are
> logically identical get unified, so unicode right now is diametrically
> opposed to the position you are describing, and for good reason.

Unicode already has a bunch of "equivalents" (look eg. at digraphs
"LJ", "Lj" ...).  This is the same thing, but only to a bigger
extent. 

> Certainly the character is used differently. However, I would assert
> that it is indeed the same character. Both English and Spanish
> use latin script.

Well, it depends on what we define character as.  If it's a symbol
used to write down *speech*, then it is not the same character, since
they indicate different speech patterns.  If it's a glyph of any
script, then it is the same character.  As I said, in a multi-script
society, I'm more leaned toward thinking of script as display
property, and not property of a character.  FWIW, I'm perfectly aware
that Serbian can be written down using IPA, Chinese glyphs (I
guess, I don't know anything about it), Arabic alphabet, etc.  After
all, there're some countries which switched between Arabic, Cyrillic
and Latin script in 50-100 years (Azerbaijan).  Is there any reason
to have text written with Arabic script 70 years ago be unreadable to
today's people, simply because of script change?

> This I think will never happen: codepoints that carry language
> information are no longer codepoints. Remeber: characters are not
> only used for language, they can also be map symbols, mathematical
> operators, fancy shapes, etc.

Heh, but that's exactly the opposite.  Have you never heard of
"mathematical language" or "language of mathematics"?  That's what
mathematical symbols represent (TeX is a good example of this: it
uses completely different code points [look up the "family" concept]
for "a" in mathematics and for "a" in regular text; of course, due to
its heritage, it is stream oriented [you enter "math mode", and you
leave "math mode", but you type the same ASCII "a"]).  Same with map
symbols (they represent language of map makers, or whoever), etc.

> Also, imagine the chaos for OCR programs: you'd have to tell them
> ahead of time which language they are supposed to read in. 

Yeah, that's a tough thing to do :)  Till a while ago, we had to set
encoding manually for anything that wasn't ISO-8859-1 (Latin-1).
With OCR, there's so much to do besides, that setting a language "for
all following pages" is trivial.  I'm certain nobody would mind that
(and most would rarely need to change this, since people tend to use
one or two languages).

> Also, instead of latinâcyrillic converters you have a proliferation of
> EnglishâFrench, EnglishâBritish English, SpanishâItalian.
> converters instead. (overall a much worse place to be)

Why do you think so?  It'd be possible to have two kinds of
convertors: visual converters (glyph-by-glyph, eg. "j" goes into
"j"), and phonetic converters (eg. Spanish "j" goes into "h" or
whatever).  First would be trivial (even simpler than current latin
to cyrillic convertors), and other not so (but still possible).  At
the very least, readers would be aware of the language they'd need to
correctly read out the text they're viewing (if provided with
enough visual clue).

> I do agree that it is merely a first attempt at an Ãber-encoding,
> however I have yet to hear of anyway that it could be fundamentally
> improved upon.

I hope that you, beside all the practical issues as of now, consider
that my suggestion has some good properties for an encoding.  I'm not
saying that it is what should be the next best encoding, but that
it's easy to come up with encodings which have some desireable
properties.

> Perhaps eliminating all precomposed glyphs would be one such
> improvement, but unicode already supports NFD, so it is already
> possible to use it as such.

Ok, so we finally agreed that Unicode/UTF-8 is probably not the
end-point in evolution of encodings, even though it seems to support
anything people may come up with right now :)

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode: endpoint of evolution of encodings?

Reply via email to