Re: current idea

George W Gerrity Fri, 02 Nov 2001 13:31:08 -0800

At 09:06 +0900 2001-11-02, Kenichi Handa wrote:
>
>Markus Kuhn <[EMAIL PROTECTED]> writes:
>>  Question 1:
>
>  > There is a contradiction in the above: A 4-byte UTF-8 word has only
>  > space for 6*3+3=21 payload bits, so how do you plan to fit 22 bits in
>  > this?
>
>Oops, sorry, it is just my mistake.  I mean 5-byte.
>
>>    b) Instead of UTF-8, use your own variant (let's call it UTF-E1)
>>       which uses for example the following 4 multi-byte sequences:
>
>>         0xxxxxxx
>>         110xxxxx 10xxxxxx
>>         1110xxxx 10xxxxxx 10xxxxxx
>>         1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx
>
>Interesting idea!  But, I think we don't have to save just
>one byte for vare rarely used characters.
>
>>       But if you really want to deviate from UTF-8, then it is worth
>>       examining more fully, what properties/tradeoffs of UTF-8
>>       are actually needed for the new Emacs buffer-multi-byte encoding.
>>       UTF-8 is ASCII compatible, preserves the UCS-4BE strcmp result
>>       and is self synchronizing. Is all that needed inside an Emacs
>>       buffer? Would for example a simpler 21-bit encoding (let's
>>       call it UTF-E2) without self-synchronization but all the other
>>       properties such as
>
>>         0xxxxxxx
>>         1xxxxxxx 1xxxxxxx 1xxxxxxx
>
>>       be better suited (it would require slightly modified
>>       string-search algorithms though, for instance)?
>
>As we need 22-bit, we must encode all non-ASCII chars in
>4-byte with the above idea.  Isn't it too much?
>
>>    c) With 21-bit words, you support the range 0x00_00_00 to
>>       0x1F_FF_FF. But as Unicode and ISO promised that they will
>>       never use any code points above U-10FFFF, you have even in
>>       a 21-bit word the 0xF_00_00 = 983040 code positions
>>       0x11_00_00 to 0x1F_FF_FF available for private use by emacs.
>>       Aren't almost a million private use positions more than good
>>       enough for what Emacs could need privately?
>
>CCCII will require 884736 (= 96*96*96) code-space, even
>though it is vary sparse.
>
>>  Question 2:
>
>>  Many encodings (such as UTF-8 and others) have many possible
>>  malformed sequences that a normal decoder would reject. What will
>>  the UTF-8 -> Emacs converter do if it runs into one of these?
>
>>  Suggestion: It would seem good to have in the 21/22-bit Emacs space 256
>>  special characters allocated for representing bytes that came from
>>  malformed sequences. They would be displayed to the user in some \hex
>  > notation, they can be edited like any normal characters and there are even
>>  keyboard functions for inserting new malformed UTF-8 bytes. The Emacs ->
>  > UTF-8 encoder will insert these bytes into the produced bytestream such
>>  that a UTF-8 -> Emacs -> UTF-8 roundtrip becomes a completely 100%
>>  binary-transparent operation.
>
>I mostly agree.  Currently, for such an invalid byte, I
>think we can use a little trick of representing raw
>0x80..0xFF by this sequence:
>       1100000x 10xxxxxx
>
>(following-char) will return 0x80..0xFF on such a place,
>thus then can't be distinguished from normal Unicode
>characters, but it won't be a big problem.


Ai.ai, ai! Kludge upon kludge! Surely it really IS a better idea to 
start back at the beginning and rewrite the underlying Lisp engine to 
handle UTF-8, and do it right.

To pick up on part of this conversation, UNLESS you use a 
fixed-length internal code for ALL Unicode characters (and I suspect 
the problem is with the underlying Lisp that makes it expensive to do 
the obvious and use UCS-32), OR you use UTF-8 or some other 
self-synchronising variable-length code, then emacs (or the 
maintainer) will lose its (his/her) way.

Emacs is already so complex that it is almost impossible to maintain. 
That is partly because of its enormous feature set, partly because it 
just grew like Topsy and has little coherence in its design, and 
partly because it already has MANY kludges in it to handle multiple 
character sets and/or locales.

I joined this group a few weeks ago because of an interest in 
cross-platform, cross-OS multilingual text editing and document 
processing, and have been following this conversation with mounting 
astonishment over the contortions you guys have been going through to 
make emacs jump through hoops.

If you clean emacs up so that UTF-8 is its native character set, then 
you have only ONE, CLEAN interface to design around. It should handle 
ASCII (and ISO Latin-1?) transparently, as it is a clean subset. That 
in itself should keep 90% of users quiet and satisfied. You handle 
all those other (obsolete) codes by providing off-line pipe 
translators to and from files. Or, if off-line isn't satisfactory, 
then do your mapping by front- and back-end mappers, like you seem to 
be doing now anyway, except that your internal codes are so broke 
that you can't find anyplace to attach more shoe strings and sealing 
wax to patch it.

George
-- 
Dr George W Gerrity             Phone:  +61 2 6386 2679
P O Box 158                     Fax:    +61 2 6386 3431
Harden, NSW 2587, AUSTRALIA     Time:   +10 hours (ref GMT)
PGP RSA Public Key Fingerprint: 73EF 318A DFF5 EB8A  6810 49AC 0763 AF07
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: current idea

Reply via email to