On Dec 5, 2004, at 12:27 AM, Tim Finney wrote:

my co-worker suggested encoding entire words in Unicode.

The "word" is considerably less well-defined than the character. The set of words is open-ended. If you'd like to see where you go when you start trying to encode words, take a look at CJK Extension B. CJK ideographs are much like words, in that they are both comprised of more basic units. English words are composed of letters, while ideographs are composed of strokes. If you encode only higher level constructs, then you must address the issue of input/indexing via lower-level units. So, there's no way to escape from defining the lower-level units. If you mean to suggest encoding words as shorthand for sequences of encoded low-level units, that might work for very specific, well-defined purposes. But whenever someone creates a neologism (and word-creation is an on-going process in all living languages), you need to revisit the encoding process, and encode a new unit. This is burdensome, to say the least. I think that most people who work on encoding like to imagine that it is mostly a finite task. Maintenance of the standard is infinite, but encoding should taper off, comparatively, over time. Except for encoding of CJK ideographs.





Reply via email to