On Dec 5, 2004, at 07:02 PM, Doug Ewell wrote:
A word-based encoding for English could automatically assume spaces
where they are appropriate. The sentence:
What means this, my lord?
would have seven encodable elements: the five words, the comma, and the
question mark. Spaces would be
Richard Cook rscook at socrates dot berkeley dot edu wrote:
Well, why stop with words, my lord? Why not just encode all sentences,
paragraphs, pages, chapters, books, libraries, or your higher level
unit of choice, for that matter.
...
Whether you choose to associate a single glyph with your
Dear All
This is off topic, so feel free to ignore it.
The other day I was telling a co-worker about Unicode and how the UTF-8
encoding system works. During the far ranging discussions that followed
(we are public servants), my co-worker suggested encoding entire words
in Unicode.
This sounds like
Tim Finney [EMAIL PROTECTED] writes:
This would reduce the
bandwidth necessary to send text.
Would it really? Ignoring all the other details (being limited
to English, for one), would words that might take up to six bytes
in UTF-8 really compete with the normal encoding, with most words
taking
On Dec 5, 2004, at 12:27 AM, Tim Finney wrote:
my co-worker suggested encoding entire words in Unicode.
The word is considerably less well-defined than the character. The
set of words is open-ended. If you'd like to see where you go when you
start trying to encode words, take a look at CJK
entire words
in Unicode.
This sounds like heresy to all of us who know that Unicode is meant only
for characters. But wait a minute... Aren't there a whole lot of
codepoints that will never be used? 231 is a big number. I imagine that
it could contain all of the words of all of the languages as well
From: Ray Mullan [EMAIL PROTECTED]
I don't see how the one million available codepoints in the Unicode
Standard could possibly accommodate a grammatically accurate vocabulary of
all the world's languages.
You have misread the message from Tim: he wanted to use code points above
U+10 within
Philippe Verdy writes:
Suppose that Unicode encodes the common English words the, an, is,
etc... then a protocol
could decide that these words are not important and will filter them.
Drop the part of the sentence before then. A protocol could delete the,
an, etc. right
now. In fact, I
. Starner
Sent: Sunday, December 05, 2004 11:49 AM
To: [EMAIL PROTECTED]
Subject: Re: Unicode for words?
Philippe Verdy writes:
Suppose that Unicode encodes the common English words the, an, is,
etc... then a protocol
could decide that these words are not important and will filter them.
Drop
Don't misinterpret my words or arguments here: the purpose of the question
was strictly about which UTF or other transformation would be good for
interoperability, and storage, and whever it would be a good idea to encode
words with standard codes.
So in my view, it is completely unneeded to
Philippe Verdy [EMAIL PROTECTED] writes:
Drop the part of the sentence before then. A protocol could delete the,
an, etc. right
now. In fact, I suspect several library systems do drop the, etc. right
now. Not that this
makes it a good idea, but that's a lousy argument.
If such a
So here is the idea: why not use the unused part (231 - 221 =
2,145,386,496) to encode all the words of all the languages as well.
You
could then send any word with a few bytes. This would reduce the
bandwidth necessary to send text. (You need at most six bytes to
address
all 231 code points,
Hohberger, Clive CHohberger at zebra dot com wrote:
When I went back and recoded those same words with leading or trailing
spaces (denoted here by _) as: _the, the_ _and, and_, etc.
as single bytes, I found a huge gain in efficiency in terms of the
number of bytes to encode the sma e English
13 matches
Mail list logo