|
Danilo Segan wrote: New policies such as "no more precomposed glyphs" also indicate that we're talking about glyph repository, not about character repository (i.e. "no more precomposed glyphs, since you can get those glyphs by combining existing glyphs", even though they may have entirely different meaning and be separate characters in their own right). This may be more of a practical issue: for some scripts such as Korean, representing every possible character and partial character could require a very large amount of codespace. We only have the precomposed characters now for compatibility with platforms that simply dont support composition whatsoever (all too common still, sadly). For example: do these both work under your mailreader? NFC: Tiáng Viát NFD: TieÌÌng VieÌÌt (for me under mozilla mail the second one looks slightly different, which means its not working perfectly)
When you have multilingual documents you can more easily see why that is impractical. There is no easy way for a piece of software to know that some words are Spanish and some are English. If the two languages had no overlaping codepoints whatsoever you could very easily end up with English text encoded with Spanish codepoints and vice versa. Even characters which look different in different scripts but are logically identical get unified, so unicode right now is diametrically opposed to the position you are describing, and for good reason. Unicode, as it is, is closer to common glyph repository (AFII anyone?) than character repository (ok, backwards compatibility is also responsible for this, because of things like English ligatures, etc). FWIW, I'd assert that "j" in Spanish is not the same thing as "j" in English (and that one is easily proved), apart from them being represented with the same *glyph*. Certainly the character is used differently. However, I would assert that it is indeed the same character. Both English and Spanish use latin script. Of course, with Unicode, it's current practice to add language marks in a text stream instead (eg. with XML using 'xml:lang' tagging, or global metadata such as MIME header fields), but that beats one big advantage of Unicode: you can read it with a stateless machine, from any point in the stream (big win for network applications). This holds even for transformation format such as UTF-8 (Unicode itself is no big deal without UTF-8, IMHO).Ideal case, IMO, would be a character encoding standard where one can deduce all properties of a character from itself (i.e. when my parser runs across "Ð" it will know that it is a "lowercase b in Serbian" or "lowercase b in Russian"). This I think will never happen: codepoints that carry language information are no longer codepoints. Remeber: characters are not only used for language, they can also be map symbols, mathematical operators, fancy shapes, etc. Also, imagine the chaos for OCR programs: you'd have to tell them ahead of time which language they are supposed to read in. Also, instead of latin â cyrillic converters you have a proliferation of English â French, English â British English, Spanish â Italian. converters instead. (overall a much worse place to be) I do agree that it is merely a first attempt at an Ãber-encoding, howeverI don't get what this has to do with Unicode/UTF-8 being The Ultimate among encodings. It's only the best we've got so far. I have yet to hear of anyway that it could be fundamentally improved upon. Perhaps eliminating all precomposed glyphs would be one such improvement, but unicode already supports NFD, so it is already possible to use it as such. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ |
- gcc and utf-8 source Egmont Koblinger
- Re: gcc and utf-8 source Bruno Haible
- Re: gcc and utf-8 source Egmont Koblinger
- Re: gcc and utf-8 source Edward H. Trager
- Re: gcc and utf-8 source srintuar
- Re: gcc and utf-8 source Bruno Haible
- Re: gcc and utf-8 source Danilo Segan
- Re: gcc and utf-8 source srintuar
- Re: gcc and utf-8 source Antoine Leca
- Unicode: endpoint of evolution ... Danilo Segan
- Re: Unicode: endpoint of ev... srintuar
- Re: Unicode: endpoint of ev... Christopher Fynn
- Re: Unicode: endpoint of ev... Danilo Segan
- questions with combining ch... Egmont Koblinger
- Re: questions with combinin... Antoine Leca
- Re: questions with combinin... srintuar
- Re: questions with combinin... Henry Spencer
- Re: questions with combinin... Egmont Koblinger
- Re: questions with combinin... Henry Spencer
- Re: questions with combinin... Danilo Segan
- Re: questions with combinin... Edward H. Trager
