Hohberger, Clive <CHohberger at zebra dot com> wrote: > When I went back and recoded those same words with leading or trailing > spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc. > as single bytes, I found a huge gain in efficiency in terms of the > number of bytes to encode the sma e English text. Next, when you look > at the most common word starting letters and encode them as "_s" and > "_t", etc., and the most common word terminator letters and encode > them as "r_", "d_", etc., you gain additional efficiency in a 256- > codeword alphabet/word encoding for English. > > What it said to me is that from a coding efficiency viewpoint is that > we need to think of words in an alphabetic language at a sequence of > letters with the space as either a prefix or terminator character, > rather than the space as a separator character between words > represented as alphabetic strings.
A word-based encoding for English could automatically assume spaces where they are appropriate. The sentence: "What means this, my lord?" would have seven encodable elements: the five words, the comma, and the question mark. Spaces would be automatically filled in as needed, not explicitly encoded. This implies "standard" English punctuation and spacing conventions, however that is defined. For French conventions, there would probably be a space before the question mark as well. Such an encoding would probably also include logic to capitalize the first word of each sentence, plus the ability to override this logic for proper names and non-capitalized sentences. There might also be unification of conjugations and declensions (and similar for other languages) to conserve space. "Boy" and "boys" might be encoded with the same code point, with contextual clues elsewhere in the sentence to disambiguate the two. And, of course, there would have to be an escape mechanism to ordinary character-based encoding, because such a system will never contain every word one might wish to encode, even just for English (think proper names again), and because "standard" punctuation and spacing rules don't always apply. This is similar to the situation with sign languages, which are word- and phrase-based but allow a fallback to fingerspelling. None of this, however interesting it may be, has anything to do with Unicode. Unicode is a system for encoding characters, not words or pictures or ideas. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/