At 22:34 +0100 2001-11-03, Werner LEMBERG wrote: > >> Ai.ai, ai! Kludge upon kludge! Surely it really IS a better idea to >> start back at the beginning and rewrite the underlying Lisp engine >> to handle UTF-8, and do it right. > >Maybe a misunderstanding. The underlying Lisp engine *never* sees >UTF-8. We are talking about buffer and string representations.
I didn't keep all the copies of past correspondence, but I understood one of a week or so ago to say that part of the reason for not doing an entire rewrite was because the lisp engine didn't talk UTF-8, and, moreover, couldn't easily be made to. > > To pick up on part of this conversation, UNLESS you use a >> fixed-length internal code for ALL Unicode characters (and I suspect >> the problem is with the underlying Lisp that makes it expensive to >> do the obvious and use UCS-32), > >A 22bit integer is used for that purpose. That comment in another letter re-enforced my belief that the lisp engine was the trouble. I ASSUMED that the lisp atom was a 32-bit word, and that the missing ten bits were taken up with tags, etc. The point is, however, that maybe 22 bits is OK for this round, but what do you do in a year or two when the higher planes get more populated, and someone wants to use emacs for some quick and dirty editing of a scholarly work on cuneiform, say? > > If you clean emacs up so that UTF-8 is its native character set, >> then you have only ONE, CLEAN interface to design around. It should >> handle ASCII (and ISO Latin-1?) transparently, as it is a clean >> subset. That in itself should keep 90% of users quiet and >> satisfied. > >Again, what we are talking about here is nothing the casual user will >ever see. That is largely, but not completely true, as near as I can tell from previous correspondence. What prompted my wail was the contortions you appear to be going through to arrive at this transparency, when it is anything but underneath. I am much more concerned that having done all, the result will be totally (as opposed to nearly) unmaintainable. I was trying to say that a clean rewrite is probably overdue anyway, and in the end, it might just be quicker, especially since you can then separate most of the translation to the I/O interface during saves and restores, rather than doing it inside the bowels of the editor on the fly. Moreover, if internal code is canonical UTF-8, and not some historical (essentially) single-byte-oriented code, you will have the 1:n translator problem rather than what is essentially an n^2 translator problem, the latter arising because the internal code is such a bad fit for everything. A final point is that by hiving off the translators to I/O, you can delay writing many of them, since there are already some pretty reasonable UTF-8 <-> Encoding-X translators available that can probably be adapted as UN*X pipes. George -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
