I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages. You're overlooking the question of which versions of words -- 'color' or 'colour' in English for instance -- would be used in such a system -- or shall we have all of them? There's also the matter of words that change depending on their grammatical usage: 'teach' meaning 'house' in Gaelic becomes 'tÃ' in certain cases, 'cep' meaning 'pocket' in Turkish becomes 'cebim' when it's my pocket, 'cebin' when it's your pocket, cebi when it's a third person's pocket and cebimiz when it's our pocket -- although the heaven knows what sort of garment might accommodate 'our pocket' at this laboured stage of the point I'm making.

Mind you it was a nice idea that had me dreaming for a bit this morning -- until the caffeine kicked in, that is.


Tim Finney wrote:

Dear All

This is off topic, so feel free to ignore it.

The other day I was telling a co-worker about Unicode and how the UTF-8
encoding system works. During the far ranging discussions that followed
(we are public servants), my co-worker suggested encoding entire words
in Unicode.

This sounds like heresy to all of us who know that Unicode is meant only
for characters. But wait a minute... Aren't there a whole lot of
codepoints that will never be used? 231 is a big number. I imagine that
it could contain all of the words of all of the languages as well as all
of their characters. According to Marcus Kuhn's Unicode FAQ
(http://www.cl.cam.ac.uk/~mgk25/unicode.html), "Current plans are that
there will never be characters assigned outside the 21-bit code space
from 0x000000 to 0x10FFFF, which covers a bit over one million potential
future characters".

So here is the idea: why not use the unused part (231 - 221 =
2,145,386,496) to encode all the words of all the languages as well. You
could then send any word with a few bytes. This would reduce the
bandwidth necessary to send text. (You need at most six bytes to address
all 231 code points, and with a knowledge of word frequencies could
assign the most frequently used words to code points that require
smaller numbers of bytes.) Whether text represents a significant
proportion of bandwidth use is an important question, but because
bandwidth = money, this idea could save quite a lot, even if text only
represents a small proportion of the total bandwidth. Phone companies
could use encoded words for transmitting SMS messages, thereby saving
money on new mobile tower installations, although they are going to put
in G3 (video-capable) anyway.

All of the machinery (Unicode, UTF-8, web crawlers that can work out
what words are used most often) is already there.

Someone must have already thought of this? If not, my co-worker, Zack
Alach, deserves the kudos.

Best

Tim Finney






Reply via email to