Re: Unicode for words?

John D. Burger Sun, 05 Dec 2004 19:13:15 -0800

So here is the idea: why not use the unused part (231 - 221 = 2,145,386,496) to encode all the words of all the languages as well. You could then send any word with a few bytes. This would reduce the bandwidth necessary to send text. (You need at most six bytes to address all 231 code points, and with a knowledge of word frequencies could assign the most frequently used words to code points that require smaller numbers of bytes.)

This is called text compression, and it already works pretty well - better than the suggested scheme would, I think, given where the code points are.

As to encoding all the words in all the languages, 2 billion code points probably isn't enough - counting scientific terms, some estimates range to 2 million words in English. Multiply by all the languages, and you're getting to within a factor of two or so of the available space.

This ignores the fact that languages grow much more quickly than you'd imagine. I can't find the reference, but Ken Church, I think, did some estimates using newswire data and found that vocabulary growth does not seem to asymptote - even the growth =factor= doesn't asymptote.

Finally, this assumes that everyone could agree on what a word is. Many languages have no explicit word segmentation, e.g., Chinese, Japanese, Thai. Sorry, I can't find this reference either, but someone had native speakers segment Chinese text for word boundaries, and there was substantial disagreement. Even in English, I suspect there would be some disagreement, e.g., "freeform" vs "free-form" vs "free form".

We can't even always agree on what a character is.

- John Burger
  MITRE

Re: Unicode for words?

Reply via email to