Several years ago I wrote an English language compression routine for bar codes 
which encoded "the", "an", "and", etc. as single byte values. What I then 
discovered is that the biggest remaining single waste of codewords in English 
text is the spaces between words!! 

When I went back and recoded those same words with leading or trailing spaces 
(denoted here by "_") as: "_the", "the_" "_and", "and_", etc. as single bytes, 
I found a huge gain in efficiency in terms of the number of bytes to encode the 
sma e English text. Next, when you look at the most common word starting 
letters and encode them as "_s" and "_t", etc., and the most common word 
terminator letters and encode them as "r_", "d_", etc., you gain additional 
efficiency in a 256-codeword alphabet/word encoding for English.

What it said to me is that from a coding efficiency viewpoint is that we need 
to think of words in an alphabetic language at a sequence of letters with the 
space as either a prefix or terminator character, rather than the space as a 
separator character between words represented as alphabetic strings.
Clive Hohberger


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of D. Starner
Sent: Sunday, December 05, 2004 11:49 AM
To: [EMAIL PROTECTED]
Subject: Re: Unicode for words?


"Philippe Verdy" writes:

> Suppose that Unicode encodes the common English words "the", "an", "is", 
> etc... then a protocol 
> could decide that these words are not important and will filter them. 

Drop the part of the sentence before "then". A protocol could delete "the", 
"an", etc. right
now. In fact, I suspect several library systems do drop "the", etc. right now. 
Not that this
makes it a good idea, but that's a lousy argument.
-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm





Reply via email to