Unicode Support for GNU Emacs
*****************************
This memo documents the current plans for bringing Unicode to GNU
Emacs. It describes the requirements and constraints for the new Emacs
character set, and the transformation format used to encode this
character set in buffers.
It reflects the discussion on the `emacs-unicode' mailing list and
the `Emacs-Unicode-990824' proposal.
Version $Revision: 1.1 $, written by Florian Weimer.
Requirements
============
The internal character code of a character has to fit in 22 bits.
(The remaining bits of a 32 bit host integer are required for tagging.)
The representation of characters in buffers and strings has to be
compact. 22 and more bits per ASCII character are not acceptable.
Latin scripts are unified.
There are strong reservations regarding Han unification. Emacs must
be able to display Han characters using a font which matches the
expectations of CJKV users.
In addition, there are some character sets for which no corresponding
code points have been assigned yet in Unicode.
The Emacs character set should deviate as little as possible from the
Unicode character set (and similarly, from other included character
sets). Each deviation has to be documented, and since documentation is
now widely available [Unicode], it does not make sense to rewrite this
documentation from scratch.
(Up to this point, these requirements were mentioned in previous
discussions on the `emacs-unicode' mailing list.)
We should assume that UTF-8 [RFC 2279] becomes the dominant
character set on GNU systems. Users will want to enable it by default.
Therefore, we have to guarantee the following things:
* Emacs must be able to read any file in UTF-8, even if it contains
invalid UTF-8 sequences.
* If a file is read into Emacs and written again without editing,
the written file must match the original, including possibly broken
UTF-8 sequences.
* If the user instructs Emacs to read a file, edits a certain part,
and writes it back, portions wich have not been edited should not
change in any way (even in the presence of broken UTF-8 sequences).
On some proprietary platforms, there is a strong trend towards
UTF-16, and similar requirements apply there (with broken surrogate
pairs instead broken UTF-8 sequences).
Rejected Requirements
=====================
Latin unification means that it is not possible to read an ISO 2022
encoded file (which might contain several scripts from ISO 8859 unified
in Unicode), and write it back again, so that it matches the original.
In addition, the shape of accents varies from one Latin script to
another, and those accents are unified in Unicode. This might
introduce slight typographic in accuracies if the wrong font is chosen,
which seem, however, to be acceptable in a text editor.
Tools Available for Implementation
==================================
We can achieve the Latin unification by either carefully unifying the
existing MULE charsets, or by switch to Unicode. Because of other
requirements, in particular documentation, the latter seems to
desirable.
There are several approaches for working around Han unification:
* plane 14 language tags [Plane14] (now an official part of Unicode)
* text properties
* separate CJKV character sets (in particular for KJV users, C seems
to be not so problematic)
A language tag in each character is not possible because of the
22 bit limit for a character code.
Because of the need for a Han unification workaround, straightforward
UCS-4 cannot be used for the Emacs character set.
The Current Proposal
====================
The GNU Emacs Emacs Proposal consists of two parts: A character set,
and an encoding of this character set for use in buffers and strings.
Basic semantics have not been discussed much yet.
The Emacs Character Set
-----------------------
The Unicode-compatible Character Set for Emacs ("UCS-E") is based on
UCS-4. In the following, we use the U+ABCDEF notation (where ABCDEF
are hexadecimal digits) to refer to UCS-4 characters, and the E+ABCDEF
notation to refer to characters in UCS-E.
The character range E+000000 up to E+10FFFF is identical to UCS-4
(U+000000 up to U+10FFFF, 17 planes of 65,536 code points each). This
is exactly the range which is addressable using surrogate pairs and
UTF-16.
However, the planes beyond this range are used differently: planes 17
to 23 are reserved for Emacs (E+110000-E+17FFFF), planes 24 to 31 are
intended for private use (E+180000-E+1FFFFF), and planes 32 to 63 are
partly used for encoding CJK characters, partly for private use
characters (E+200000-E+3FFFFF). This results in the following picture,
with bit masks in the first column:
00 xxxx xxxxxxxx xxxxxxxx Unicode U+000000 - U+0FFFFF
01 0000 xxxxxxxx xxxxxxxx Unicode U+100000 - U+10FFFF
01 0ppp xxxxxxxx xxxxxxxx 7 64K planes reserved for
Emacs
01 1ppp xxxxxxxx xxxxxxxx 8 64K planes for private use
1x xxxx xxxxxxxx xxxxxxxx for private use, CNS 3-16,
and CCCII
Japanese (and Korean and Vietnamese) Han characters may be mapped in
the range E+11000-E+17FFFF (third line in the table above, consisting
of 458,752 code points). Characters needed for representing broken
UTF-8 sequences transparently can reside here, too.
CCCII characters reside in the range E+328000-E+3FFFFFF (contained in
the last line of the table above). A previous version of this proposal
allocated code points for characters in CNS planes 3 to 16 in the range
E+308800-E+327FFF (last line as well), but recent versions of the
Unicode standard include some of these characters already.
As a result, private use characters span at least the range
E+180000-E+3087FF (in the fourth and last lines of the table above),
which contains 1,607,679 code points.
The Buffer and String Encoding
------------------------------
For the buffer encoding, we use a modified version of UTF-8
[RFC 2279], which we call "UTF-E".
UTF-E is a variable-length transformation format of UCS-E, encoding
UCS-E code points as a sequence of bytes (octets). It is identical to
UTF-8, with the exception that it is applied to UCS-E instead of UCS-4.
In addition, in order to achieve complete transparency for partially
broken UTF-8 (which might contain surrogates) UTF-16 (which might
contain invalid surrogate characters), it is necessary that the
transformation preserves all surrogate characters. (However, the
display engine can show them as invalid; no surrogate processing has to
be performed at this point.)
As a result, ASCII characters (in the range E+0000 to E+007F) are
encoded as themselves, and most Latin characters are encoded using two
bytes. Since UCS-E is a 22 bit character set, at most five bytes are
required per encoded characters.
Open Issues
===========
The additonal requirements (round-trip compatibility for UTF-8 and
UTF-16, even if the encoding is broken) have to be reviewed.
It is currently not clear if we still need to allocate code points
for all the CNS planes from 3 to 16, since Unicode 3.1 seems to already
include characters from CNS planes 1 to 7, and 15.
In order to check if the allocated ranges are sufficient, we should
start to actually define the non-Unicode portions of the UCS-E
character set now, and not defer it any longer.
The meaning of a private use character should be clarified. We
probably need three ranges: one which can be used by the end user (and
to which UCS planes like plane 17 and beyond can be mapped, for
example), a range for special characters used by Lisp packages, and
another one for future character allocation.
For compactness reasons, we might want to modify UTF-E so that it
encodes the special characters used for broken UTF-8 sequences as
overlong two-byte sequences: the UCS-E character corresponding to an
isolated `80h' character might be encoded as the bytes `C0h 80h', or
`FFh' might result in `C1h BFh'. However, we still have to assign
UCS-E code points distinct from E+0080-E+00FF, otherwise it will be
complicated to handle these characters correctly at the Lisp level. In
addition, We should probably reserve 256 UCS-E characters, so that we
can represent illegal byte sequences in other encodings, too (and not
just UTF-8).
Similarly, to save space, Japanese characters could be encoded in
overlong forms, too. However, such tricks should be used with care
because they make the transformation much more complicated.
When UTF-8 or UTF-16 is decoded and converted to UCS-E, it is
possible to use plane 14 language tags present in the input to undo Han
unification. However, this must be done carefully, and the language
tags have to be preserved in the UTF-E encoding, otherwise it would be
impossible to write the original file to disk again. (Perhaps the
encoding/decoding algorithms should be presented in a later revision of
this document.)
History
=======
This memo is a rough summary of the discussion on the
`emacs-unicode' mailing list. The resulting proposal (called
`Emacs-Unicode-990824' in its last version) was reformulated to be more
explicit, and the UCS-E code point allocation was moved to the
description of UCS-E (it previously was described in terms of UTF-E,
which seems to be suboptimal).
Bibliography
============
[RFC 2279]
F. Yergeau: UTF-8, a transformation format of ISO 10646.
Published as RFC 2279.
[Unicode]
The Unicode web site at <http://www.unicode.org/> offers access to
the complete contents of _The Unicode Standard, Version 3.0_,
which was originally published by Addison Wesley Longman, Inc.
(In fact, the web site is the only source for subsequent Unicode
Standard versions such as 3.1, which have not appeared in print.)
[Plane14]
Tag Characters. Section 13.7 in The Unicode Standard, Version 3.1.
<http://www.unicode.org/unicode/reports/tr27/#tag>
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/