Hi, Please don't use HTML mail, I have problems reading it, and it messes up encoding for me (since I have to use sort of "view source").
Today at 1:31, srintuar wrote: > This may be more of a practical issue: for some scripts such as Korean, > representing every possible character and partial character could > require a very large amount of codespace. We only have the precomposed > characters now for compatibility with platforms that simply dont support > composition whatsoever (all too common still, sadly). I understand that, and with evolution of computers and encodings, I hope that will changeâthat's the entire point :) > For example: do these both work under your mailreader? No, Emacs (Gnus) doesn't have a "combining" features, so it just lists the characters one after another. Pango (I use Gnome for everything else) should do it at least a bit better, but the results will probably not be the same. > When you have multilingual documents you can more easily see why > that is impractical. There is no easy way for a piece of software to > know that some words are Spanish and some are English. If the two > languages had no overlaping codepoints whatsoever you could very > easily end up with English text encoded with Spanish codepoints > and vice versa. Yes, I'm aware of the practical problems of inputting text properly. But, most of text would be input with such information known. I.e. Spanish natives probably have "Spanish keyboard layout" active, just like I have "Serbian keyboard layout" active when I type Serbian. That means that this will cause problems in several cases: â already existing text â text input by non-natives whose language uses the same script I'd argue that the first problem would be more common (just like we have problems of switching from ISO-8859-* and other 8-bit encodings to UTF-8 now). Still, that doesn't mean that it's impossible (using UTF-8 would be considered highly impractical 15 years ago), but that it simply doesn't work right now. > Even characters which look different in different scripts but are > logically identical get unified, so unicode right now is diametrically > opposed to the position you are describing, and for good reason. Unicode already has a bunch of "equivalents" (look eg. at digraphs "LJ", "Lj" ...). This is the same thing, but only to a bigger extent. > Certainly the character is used differently. However, I would assert > that it is indeed the same character. Both English and Spanish > use latin script. Well, it depends on what we define character as. If it's a symbol used to write down *speech*, then it is not the same character, since they indicate different speech patterns. If it's a glyph of any script, then it is the same character. As I said, in a multi-script society, I'm more leaned toward thinking of script as display property, and not property of a character. FWIW, I'm perfectly aware that Serbian can be written down using IPA, Chinese glyphs (I guess, I don't know anything about it), Arabic alphabet, etc. After all, there're some countries which switched between Arabic, Cyrillic and Latin script in 50-100 years (Azerbaijan). Is there any reason to have text written with Arabic script 70 years ago be unreadable to today's people, simply because of script change? > This I think will never happen: codepoints that carry language > information are no longer codepoints. Remeber: characters are not > only used for language, they can also be map symbols, mathematical > operators, fancy shapes, etc. Heh, but that's exactly the opposite. Have you never heard of "mathematical language" or "language of mathematics"? That's what mathematical symbols represent (TeX is a good example of this: it uses completely different code points [look up the "family" concept] for "a" in mathematics and for "a" in regular text; of course, due to its heritage, it is stream oriented [you enter "math mode", and you leave "math mode", but you type the same ASCII "a"]). Same with map symbols (they represent language of map makers, or whoever), etc. > Also, imagine the chaos for OCR programs: you'd have to tell them > ahead of time which language they are supposed to read in. Yeah, that's a tough thing to do :) Till a while ago, we had to set encoding manually for anything that wasn't ISO-8859-1 (Latin-1). With OCR, there's so much to do besides, that setting a language "for all following pages" is trivial. I'm certain nobody would mind that (and most would rarely need to change this, since people tend to use one or two languages). > Also, instead of latinâcyrillic converters you have a proliferation of > EnglishâFrench, EnglishâBritish English, SpanishâItalian. > converters instead. (overall a much worse place to be) Why do you think so? It'd be possible to have two kinds of convertors: visual converters (glyph-by-glyph, eg. "j" goes into "j"), and phonetic converters (eg. Spanish "j" goes into "h" or whatever). First would be trivial (even simpler than current latin to cyrillic convertors), and other not so (but still possible). At the very least, readers would be aware of the language they'd need to correctly read out the text they're viewing (if provided with enough visual clue). > I do agree that it is merely a first attempt at an Ãber-encoding, > however I have yet to hear of anyway that it could be fundamentally > improved upon. I hope that you, beside all the practical issues as of now, consider that my suggestion has some good properties for an encoding. I'm not saying that it is what should be the next best encoding, but that it's easy to come up with encodings which have some desireable properties. > Perhaps eliminating all precomposed glyphs would be one such > improvement, but unicode already supports NFD, so it is already > possible to use it as such. Ok, so we finally agreed that Unicode/UTF-8 is probably not the end-point in evolution of encodings, even though it seems to support anything people may come up with right now :) Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/