So here is the idea: why not use the unused part (231 - 221 =
2,145,386,496) to encode all the words of all the languages as well. You
could then send any word with a few bytes. This would reduce the
bandwidth necessary to send text. (You need at most six bytes to address
all 231 code points, and with a knowledge of word frequencies could
assign the most frequently used words to code points that require
smaller numbers of bytes.)

This is called text compression, and it already works pretty well - better than the suggested scheme would, I think, given where the code points are.


As to encoding all the words in all the languages, 2 billion code points probably isn't enough - counting scientific terms, some estimates range to 2 million words in English. Multiply by all the languages, and you're getting to within a factor of two or so of the available space.

This ignores the fact that languages grow much more quickly than you'd imagine. I can't find the reference, but Ken Church, I think, did some estimates using newswire data and found that vocabulary growth does not seem to asymptote - even the growth =factor= doesn't asymptote.

Finally, this assumes that everyone could agree on what a word is. Many languages have no explicit word segmentation, e.g., Chinese, Japanese, Thai. Sorry, I can't find this reference either, but someone had native speakers segment Chinese text for word boundaries, and there was substantial disagreement. Even in English, I suspect there would be some disagreement, e.g., "freeform" vs "free-form" vs "free form".

We can't even always agree on what a character is.

- John Burger
  MITRE




Reply via email to