Re: Unicode for words?

2004-12-07 Thread Richard Cook
On Dec 5, 2004, at 07:02 PM, Doug Ewell wrote: A word-based encoding for English could automatically assume spaces where they are appropriate. The sentence: What means this, my lord? would have seven encodable elements: the five words, the comma, and the question mark. Spaces would be

Re: Unicode for words?

2004-12-07 Thread Doug Ewell
Richard Cook rscook at socrates dot berkeley dot edu wrote: Well, why stop with words, my lord? Why not just encode all sentences, paragraphs, pages, chapters, books, libraries, or your higher level unit of choice, for that matter. ... Whether you choose to associate a single glyph with your

Unicode for words?

2004-12-05 Thread Tim Finney
Dear All This is off topic, so feel free to ignore it. The other day I was telling a co-worker about Unicode and how the UTF-8 encoding system works. During the far ranging discussions that followed (we are public servants), my co-worker suggested encoding entire words in Unicode. This sounds like

Re: Unicode for words?

2004-12-05 Thread D. Starner
Tim Finney [EMAIL PROTECTED] writes: This would reduce the bandwidth necessary to send text. Would it really? Ignoring all the other details (being limited to English, for one), would words that might take up to six bytes in UTF-8 really compete with the normal encoding, with most words taking

Re: Unicode for words?

2004-12-05 Thread Richard Cook
On Dec 5, 2004, at 12:27 AM, Tim Finney wrote: my co-worker suggested encoding entire words in Unicode. The word is considerably less well-defined than the character. The set of words is open-ended. If you'd like to see where you go when you start trying to encode words, take a look at CJK

Re: Unicode for words?

2004-12-05 Thread Ray Mullan
entire words in Unicode. This sounds like heresy to all of us who know that Unicode is meant only for characters. But wait a minute... Aren't there a whole lot of codepoints that will never be used? 231 is a big number. I imagine that it could contain all of the words of all of the languages as well

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy
From: Ray Mullan [EMAIL PROTECTED] I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages. You have misread the message from Tim: he wanted to use code points above U+10 within

Re: Unicode for words?

2004-12-05 Thread D. Starner
Philippe Verdy writes: Suppose that Unicode encodes the common English words the, an, is, etc... then a protocol could decide that these words are not important and will filter them. Drop the part of the sentence before then. A protocol could delete the, an, etc. right now. In fact, I

RE: Unicode for words?

2004-12-05 Thread Hohberger, Clive
. Starner Sent: Sunday, December 05, 2004 11:49 AM To: [EMAIL PROTECTED] Subject: Re: Unicode for words? Philippe Verdy writes: Suppose that Unicode encodes the common English words the, an, is, etc... then a protocol could decide that these words are not important and will filter them. Drop

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy
Don't misinterpret my words or arguments here: the purpose of the question was strictly about which UTF or other transformation would be good for interoperability, and storage, and whever it would be a good idea to encode words with standard codes. So in my view, it is completely unneeded to

Re: Unicode for words?

2004-12-05 Thread D. Starner
Philippe Verdy [EMAIL PROTECTED] writes: Drop the part of the sentence before then. A protocol could delete the, an, etc. right now. In fact, I suspect several library systems do drop the, etc. right now. Not that this makes it a good idea, but that's a lousy argument. If such a

Re: Unicode for words?

2004-12-05 Thread John D. Burger
So here is the idea: why not use the unused part (231 - 221 = 2,145,386,496) to encode all the words of all the languages as well. You could then send any word with a few bytes. This would reduce the bandwidth necessary to send text. (You need at most six bytes to address all 231 code points,

Re: Unicode for words?

2004-12-05 Thread Doug Ewell
Hohberger, Clive CHohberger at zebra dot com wrote: When I went back and recoded those same words with leading or trailing spaces (denoted here by _) as: _the, the_ _and, and_, etc. as single bytes, I found a huge gain in efficiency in terms of the number of bytes to encode the sma e English