At 09:06 +0900 2001-11-02, Kenichi Handa wrote: > >Markus Kuhn <[EMAIL PROTECTED]> writes: >> Question 1: > > > There is a contradiction in the above: A 4-byte UTF-8 word has only > > space for 6*3+3=21 payload bits, so how do you plan to fit 22 bits in > > this? > >Oops, sorry, it is just my mistake. I mean 5-byte. > >> b) Instead of UTF-8, use your own variant (let's call it UTF-E1) >> which uses for example the following 4 multi-byte sequences: > >> 0xxxxxxx >> 110xxxxx 10xxxxxx >> 1110xxxx 10xxxxxx 10xxxxxx >> 1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx > >Interesting idea! But, I think we don't have to save just >one byte for vare rarely used characters. > >> But if you really want to deviate from UTF-8, then it is worth >> examining more fully, what properties/tradeoffs of UTF-8 >> are actually needed for the new Emacs buffer-multi-byte encoding. >> UTF-8 is ASCII compatible, preserves the UCS-4BE strcmp result >> and is self synchronizing. Is all that needed inside an Emacs >> buffer? Would for example a simpler 21-bit encoding (let's >> call it UTF-E2) without self-synchronization but all the other >> properties such as > >> 0xxxxxxx >> 1xxxxxxx 1xxxxxxx 1xxxxxxx > >> be better suited (it would require slightly modified >> string-search algorithms though, for instance)? > >As we need 22-bit, we must encode all non-ASCII chars in >4-byte with the above idea. Isn't it too much? > >> c) With 21-bit words, you support the range 0x00_00_00 to >> 0x1F_FF_FF. But as Unicode and ISO promised that they will >> never use any code points above U-10FFFF, you have even in >> a 21-bit word the 0xF_00_00 = 983040 code positions >> 0x11_00_00 to 0x1F_FF_FF available for private use by emacs. >> Aren't almost a million private use positions more than good >> enough for what Emacs could need privately? > >CCCII will require 884736 (= 96*96*96) code-space, even >though it is vary sparse. > >> Question 2: > >> Many encodings (such as UTF-8 and others) have many possible >> malformed sequences that a normal decoder would reject. What will >> the UTF-8 -> Emacs converter do if it runs into one of these? > >> Suggestion: It would seem good to have in the 21/22-bit Emacs space 256 >> special characters allocated for representing bytes that came from >> malformed sequences. They would be displayed to the user in some \hex > > notation, they can be edited like any normal characters and there are even >> keyboard functions for inserting new malformed UTF-8 bytes. The Emacs -> > > UTF-8 encoder will insert these bytes into the produced bytestream such >> that a UTF-8 -> Emacs -> UTF-8 roundtrip becomes a completely 100% >> binary-transparent operation. > >I mostly agree. Currently, for such an invalid byte, I >think we can use a little trick of representing raw >0x80..0xFF by this sequence: > 1100000x 10xxxxxx > >(following-char) will return 0x80..0xFF on such a place, >thus then can't be distinguished from normal Unicode >characters, but it won't be a big problem.
Ai.ai, ai! Kludge upon kludge! Surely it really IS a better idea to start back at the beginning and rewrite the underlying Lisp engine to handle UTF-8, and do it right. To pick up on part of this conversation, UNLESS you use a fixed-length internal code for ALL Unicode characters (and I suspect the problem is with the underlying Lisp that makes it expensive to do the obvious and use UCS-32), OR you use UTF-8 or some other self-synchronising variable-length code, then emacs (or the maintainer) will lose its (his/her) way. Emacs is already so complex that it is almost impossible to maintain. That is partly because of its enormous feature set, partly because it just grew like Topsy and has little coherence in its design, and partly because it already has MANY kludges in it to handle multiple character sets and/or locales. I joined this group a few weeks ago because of an interest in cross-platform, cross-OS multilingual text editing and document processing, and have been following this conversation with mounting astonishment over the contortions you guys have been going through to make emacs jump through hoops. If you clean emacs up so that UTF-8 is its native character set, then you have only ONE, CLEAN interface to design around. It should handle ASCII (and ISO Latin-1?) transparently, as it is a clean subset. That in itself should keep 90% of users quiet and satisfied. You handle all those other (obsolete) codes by providing off-line pipe translators to and from files. Or, if off-line isn't satisfactory, then do your mapping by front- and back-end mappers, like you seem to be doing now anyway, except that your internal codes are so broke that you can't find anyplace to attach more shoe strings and sealing wax to patch it. George -- Dr George W Gerrity Phone: +61 2 6386 2679 P O Box 158 Fax: +61 2 6386 3431 Harden, NSW 2587, AUSTRALIA Time: +10 hours (ref GMT) PGP RSA Public Key Fingerprint: 73EF 318A DFF5 EB8A 6810 49AC 0763 AF07 - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
