Several years ago I wrote an English language compression routine for bar codes which encoded "the", "an", "and", etc. as single byte values. What I then discovered is that the biggest remaining single waste of codewords in English text is the spaces between words!!
When I went back and recoded those same words with leading or trailing spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc. as single bytes, I found a huge gain in efficiency in terms of the number of bytes to encode the sma e English text. Next, when you look at the most common word starting letters and encode them as "_s" and "_t", etc., and the most common word terminator letters and encode them as "r_", "d_", etc., you gain additional efficiency in a 256-codeword alphabet/word encoding for English. What it said to me is that from a coding efficiency viewpoint is that we need to think of words in an alphabetic language at a sequence of letters with the space as either a prefix or terminator character, rather than the space as a separator character between words represented as alphabetic strings. Clive Hohberger -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of D. Starner Sent: Sunday, December 05, 2004 11:49 AM To: [EMAIL PROTECTED] Subject: Re: Unicode for words? "Philippe Verdy" writes: > Suppose that Unicode encodes the common English words "the", "an", "is", > etc... then a protocol > could decide that these words are not important and will filter them. Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this makes it a good idea, but that's a lousy argument. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm