Re: current idea

Kenichi Handa Thu, 01 Nov 2001 15:35:54 -0800

Markus Kuhn <[EMAIL PROTECTED]> writes:
> Question 1:

> There is a contradiction in the above: A 4-byte UTF-8 word has only space
> for 6*3+3=21 payload bits, so how do you plan to fit 22 bits in this?


Oops, sorry, it is just my mistake.  I mean 5-byte.

>   b) Instead of UTF-8, use your own variant (let's call it UTF-E1)
>      which uses for example the following 4 multi-byte sequences:

>        0xxxxxxx
>        110xxxxx 10xxxxxx
>        1110xxxx 10xxxxxx 10xxxxxx
>        1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx

Interesting idea!  But, I think we don't have to save just
one byte for vare rarely used characters.

>      But if you really want to deviate from UTF-8, then it is worth
>      examining more fully, what properties/tradeoffs of UTF-8
>      are actually needed for the new Emacs buffer-multi-byte encoding.
>      UTF-8 is ASCII compatible, preserves the UCS-4BE strcmp result
>      and is self synchronizing. Is all that needed inside an Emacs
>      buffer? Would for example a simpler 21-bit encoding (let's
>      call it UTF-E2) without self-synchronization but all the other
>      properties such as

>        0xxxxxxx
>        1xxxxxxx 1xxxxxxx 1xxxxxxx

>      be better suited (it would require slightly modified
>      string-search algorithms though, for instance)?

As we need 22-bit, we must encode all non-ASCII chars in
4-byte with the above idea.  Isn't it too much?

>   c) With 21-bit words, you support the range 0x00_00_00 to
>      0x1F_FF_FF. But as Unicode and ISO promised that they will
>      never use any code points above U-10FFFF, you have even in
>      a 21-bit word the 0xF_00_00 = 983040 code positions
>      0x11_00_00 to 0x1F_FF_FF available for private use by emacs.
>      Aren't almost a million private use positions more than good
>      enough for what Emacs could need privately?

CCCII will require 884736 (= 96*96*96) code-space, even
though it is vary sparse.

> Question 2:

> Many encodings (such as UTF-8 and others) have many possible
> malformed sequences that a normal decoder would reject. What will
> the UTF-8 -> Emacs converter do if it runs into one of these?

> Suggestion: It would seem good to have in the 21/22-bit Emacs space 256
> special characters allocated for representing bytes that came from
> malformed sequences. They would be displayed to the user in some \hex
> notation, they can be edited like any normal characters and there are even
> keyboard functions for inserting new malformed UTF-8 bytes. The Emacs ->
> UTF-8 encoder will insert these bytes into the produced bytestream such
> that a UTF-8 -> Emacs -> UTF-8 roundtrip becomes a completely 100%
> binary-transparent operation.

I mostly agree.  Currently, for such an invalid byte, I
think we can use a little trick of representing raw
0x80..0xFF by this sequence:
        1100000x 10xxxxxx

(following-char) will return 0x80..0xFF on such a place,
thus then can't be distinguished from normal Unicode
characters, but it won't be a big problem.

---
Ken'ichi HANDA
[EMAIL PROTECTED]
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: current idea

Reply via email to