Markus Kuhn <[EMAIL PROTECTED]> writes:
> Question 1:
> There is a contradiction in the above: A 4-byte UTF-8 word has only space
> for 6*3+3=21 payload bits, so how do you plan to fit 22 bits in this?
Oops, sorry, it is just my mistake. I mean 5-byte.
> b) Instead of UTF-8, use your own variant (let's call it UTF-E1)
> which uses for example the following 4 multi-byte sequences:
> 0xxxxxxx
> 110xxxxx 10xxxxxx
> 1110xxxx 10xxxxxx 10xxxxxx
> 1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx
Interesting idea! But, I think we don't have to save just
one byte for vare rarely used characters.
> But if you really want to deviate from UTF-8, then it is worth
> examining more fully, what properties/tradeoffs of UTF-8
> are actually needed for the new Emacs buffer-multi-byte encoding.
> UTF-8 is ASCII compatible, preserves the UCS-4BE strcmp result
> and is self synchronizing. Is all that needed inside an Emacs
> buffer? Would for example a simpler 21-bit encoding (let's
> call it UTF-E2) without self-synchronization but all the other
> properties such as
> 0xxxxxxx
> 1xxxxxxx 1xxxxxxx 1xxxxxxx
> be better suited (it would require slightly modified
> string-search algorithms though, for instance)?
As we need 22-bit, we must encode all non-ASCII chars in
4-byte with the above idea. Isn't it too much?
> c) With 21-bit words, you support the range 0x00_00_00 to
> 0x1F_FF_FF. But as Unicode and ISO promised that they will
> never use any code points above U-10FFFF, you have even in
> a 21-bit word the 0xF_00_00 = 983040 code positions
> 0x11_00_00 to 0x1F_FF_FF available for private use by emacs.
> Aren't almost a million private use positions more than good
> enough for what Emacs could need privately?
CCCII will require 884736 (= 96*96*96) code-space, even
though it is vary sparse.
> Question 2:
> Many encodings (such as UTF-8 and others) have many possible
> malformed sequences that a normal decoder would reject. What will
> the UTF-8 -> Emacs converter do if it runs into one of these?
> Suggestion: It would seem good to have in the 21/22-bit Emacs space 256
> special characters allocated for representing bytes that came from
> malformed sequences. They would be displayed to the user in some \hex
> notation, they can be edited like any normal characters and there are even
> keyboard functions for inserting new malformed UTF-8 bytes. The Emacs ->
> UTF-8 encoder will insert these bytes into the produced bytestream such
> that a UTF-8 -> Emacs -> UTF-8 roundtrip becomes a completely 100%
> binary-transparent operation.
I mostly agree. Currently, for such an invalid byte, I
think we can use a little trick of representing raw
0x80..0xFF by this sequence:
1100000x 10xxxxxx
(following-char) will return 0x80..0xFF on such a place,
thus then can't be distinguished from normal Unicode
characters, but it won't be a big problem.
---
Ken'ichi HANDA
[EMAIL PROTECTED]
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/