Re: Unicode for words?

Philippe Verdy Sun, 05 Dec 2004 08:43:05 -0800

From: "Ray Mullan" <[EMAIL PROTECTED]>

I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages.

You have misread the message from Tim: he wanted to use "code points" above U+10FFFF within the full 32-bit space (meaning more than 4 billions codepoints, when Unicode and ISO-10646 only allow 2 millions...)

He wanted to use that to encode words on a single code point, as a possible compression scheme. But he forgets that words can have its component letters affected by style or during rendering.

Also a "font" or renderer would be unable to draw the text without having the equivalent of an indexed dictionnary of all words on the planet!

If compression is a goal, he forgets that the space gain offered by such compression will be very modest face to more generic data compressors like deflate or bzip2 that can compress the represented texts more efficiently without even needing such large dictionnary (that is in perpetual evolution by every speaker of any language, without any prior standard agreement anywhere!).

Forget his idea, it is technically impossible to do. At best you could create some protocols that will compact some widely used words (this is what WAP does for widely used HTML elements or attributes), but this is still not a standard outside of this limited context.

Suppose that Unicode encodes the common English words "the", "an", "is", etc... then a protocol could decide that these words are not important and will filter them. What will happen if these "words" do appear in non-English languages where they are semantically significant? These words would be missing. To paliate this inconvenient the codepoints would only designate the words used in one language and not the other, so "an" would have different codes whever it is used in English or in another language.

The last problem is that too many languages do not have well-established and computerized lexical dictionnaries, and grammatical rules that allow composing words are not always known. The number of words in a single language cannot also be bound to a known maximum (a good example in German where composed words are virtually unlimited!)

So forget this idea: Unicode will not create a standard to encode words. Words will be represented after modeling them to a script system made of simpler sets of "letters" or "ideographs" or punctuation and diacritics. The representation of words with those letters is an orthographic system, specific to each language, that Unicode will not standardize.

Re: Unicode for words?

Reply via email to