I'm sorry for not responding so long.
In these months, I have been working on the design of
Unicode-base Emacs, and to verify that design, I've also
been working on a sample implementation codes. But these
are done in a very slow pace. :-(
Here, I'll post a memo about my current idea. It will be
convenient to save this file and read in outline mode.
---
Ken'ichi HANDA
[EMAIL PROTECTED]
## Design of Unicode-based Emacs -*- outline -*-
### Ver.1.0 2001.11.1
## CHARACTER ##
### Definition
A character. :-p
#### Implementation detail
A character is represented by a number (a character code), both in
Elisp and C.
The code space is 22-bit. Thus, Emacs can handle 2^22 different
characters. Each character belongs to one or more charsets.
A character in a buffer/string is represented by a multibyte sequence
(UTF-8 format) of at most 4-byte.
#### C level APIs in character.h
/* Return a Lisp character whose code is C. */
#define make_char(c)
/* Nonzero iff C is an ASCII byte. */
#define ASCII_BYTE_P(c)
/* Check if Lisp object X is a character or not. */
#define CHECK_CHARACTER(x, i)
/* Nonzero iff C is a valid character code. */
#define CHAR_VALID_P(c)
/* Nonzero iff C is an ASCII character. */
#define ASCII_CHAR_P(c)
/* Nonzero iff C is an ASCII character or a control character. */
#define ASCII_OR_CONTROL_P(c)
/* Nonzero iff C is a character of code less than 0x100. */
#define SINGLE_BYTE_CHAR_P(c)
/* Nonzero if character C has valid printable glyph. */
#define CHAR_PRINTABLE_P(c)
/* How many bytes C occupies in a multibyte buffer. */
#define CHAR_BYTES(c)
/* How many columns C occupies on a screen. */
#define CHAR_WIDTH(c)
/* Store multibyte form of the character C in STR. The caller should
allocate at least MAX_MULTIBYTE_LENGTH bytes area at STR in
advance. Returns the length of the multibyte form. */
#define CHAR_STRING(c, p)
/* Like CHAR_STRING, but advance P to the end of the multibyte
form. */
#define CHAR_STRING_ADVANCE(c, p)
/* Like CHAR_STRING_ADVANCE but it is assured that C is 0..255. */
#define BYTE_STRING_ADVANCE(c, p)
/* Nonzero iff BYTE starts a character in a multibyte form. */
#define CHAR_HEAD_P(byte)
/* Nonzero iff BYTE starts a non-ASCII character in a multibyte
form. */
#define NON_ASCII_CHAR_HEAD_P(byte)
/* How many bytes a character that starts with BYTE occupies in a
multibyte form. */
#define BYTES_BY_CHAR_HEAD(byte)
/* The byte length of multibyte form at unibyte string P ending at
PEND. If STR doesn't point a valid multibyte form, return 0. */
#define MULTIBYTE_LENGTH(p, pend)
/* Like MULTIBYTE_LENGTH but don't check the ending address. */
#define MULTIBYTE_LENGTH_NO_CHECK(p)
/* Return the character code of character whose multibyte form is at P
and the length is LEN. */
#define STRING_CHAR(p)
/* Like STRING_CHAR but set LEN to the length of multibyte form. */
#define STRING_CHAR_AND_LENGTH(str, len)
/* Like STRING_CHAR but advance STR to the end of multibyte form. */
#define STRING_CHAR_ADVANCE(str)
/* Fetch the "next" character from Lisp string STRING at byte position
BYTEIDX, character position CHARIDX. Store it into OUTPUT.
All the args must be side-effect-free.
BYTEIDX and CHARIDX must be lvalues;
we increment them past the character fetched. */
#define FETCH_STRING_CHAR_ADVANCE(OUTPUT, STRING, CHARIDX, BYTEIDX)
/* Like FETCH_STRING_CHAR_ADVANCE but assumes STRING is multibyte. */
#define FETCH_STRING_CHAR_ADVANCE_NO_CHECK(OUTPUT, STRING, CHARIDX, BYTEIDX)
/* Like FETCH_STRING_CHAR_ADVANCE but fetch character from the current
buffer. */
#define FETCH_BUFFER_CHAR_ADVANCE(OUTPUT, CHARIDX, BYTEIDX)
/* Like FETCH_BUFFER_CHAR_ADVANCE but assumes STRING is multibyte. */
#define FETCH_BUFFER_CHAR_ADVANCE_NO_CHECK(OUTPUT, CHARIDX, BYTEIDX)
/* Increase the buffer byte position POS_BYTE of the current buffer to
the next character boundary. No range checking of POS. */
#define INC_POS(pos_byte)
/* Decrease the buffer byte position POS_BYTE of the current buffer to
the previous character boundary. No range checking of POS. */
#define DEC_POS(pos_byte)
/* Increment both CHARPOS and BYTEPOS, each in the appropriate way. */
#define INC_BOTH(charpos, bytepos)
/* Decrement both CHARPOS and BYTEPOS, each in the appropriate way. */
#define DEC_BOTH(charpos, bytepos)
/* Increase the buffer byte position POS_BYTE of the current buffer to
the next character boundary. This macro relies on the fact that
*GPT_ADDR and *Z_ADDR are always accessible and the values are
'\0'. No range checking of POS_BYTE. */
#define BUF_INC_POS(buf, pos_byte)
/* Decrease the buffer byte position POS_BYTE of the current buffer to
the previous character boundary. No range checking of POS_BYTE. */
#define BUF_DEC_POS(buf, pos_byte)
#### C level APIs in character.c
/* Vector of translation table ever defined.
ID of a translation table is used to index this vector. */
Lisp_Object Vtranslation_table_vector;
/* A char-table for characters which may invoke auto-filling. */
Lisp_Object Vauto_fill_chars;
/* A char-table. An element is non-nil iff the corresponding
character has a printable glyph. */
Lisp_Object Vprintable_char_table;
/* A char-table. An element is a column-width of the corresponding
character. */
Lisp_Object Vchar_width_table;
/* Translate character C by translation table TABLE. If C is
negative, translate a character specified by CHARSET and CODE. If
no translation is found in TABLE, return the untranslated
character. */
int
translate_char (table, c, charset, code)
Lisp_Object table;
int c, charset, code;
/* Convert the unibyte character C to the corresponding multibyte
character based on the current value of charset_primary. If C
can't be converted, return C. */
int
unibyte_char_to_multibyte (c)
int c;
/* Convert the multibyte character C to the corresponding unibyte
character based on the current value of charset_primary. If
dimension of charset_primary is more than one, return (C &
0xFF). */
int
multibyte_char_to_unibyte (c)
int c;
DEFUN ("characterp", Fcharacterp, Scharacterp, 1, 1, 0,
"Return non-nil if OBJECT is a character.")
DEFUN ("max-char", Fmax_char, Smax_char, 0, 0, 0,
"Return the character of the maximum code.")
DEFUN ("unibyte-char-to-multibyte", Funibyte_char_to_multibyte,
Sunibyte_char_to_multibyte, 1, 1, 0,
"Convert the unibyte character CH to multibyte character.\n\
The multibyte character is a result of decoding CH by\n\
the current primary charset (value of `charset-primary').")
DEFUN ("multibyte-char-to-unibyte", Fmultibyte_char_to_unibyte,
Smultibyte_char_to_unibyte, 1, 1, 0,
"Convert the multibyte character CH to unibyte character.\n\
The unibyte character is a result of encoding CH by\n\
the current primary charset (value of `charset-primary').")
/* same as 21.1 */
DEFUN ("char-bytes", ...)
/* same as 21.1 */
DEFUN ("char-width", ...)
/* same as 21.1 */
int strwidth (str, len)
/* same as 21.1 */
int lisp_string_width (str)
/* same as 21.1 */
DEFUN ("string-width", ...)
/* same as 21.1 */
DEFUN ("chars-in-region", ...)
/* same as 21.1 */
int chars_in_text (ptr, nbytes)
/* same as 21.1 */
int multibyte_chars_in_text (ptr, nbytes)
/* same as 21.1 */
void parse_str_as_multibyte (str, len, nchars, nbytes)
/* same as 21.1 */
int str_as_multibyte (str, len, nbytes, nchars)
/* same as 21.1 */
int str_to_multibyte (str, len, bytes)
/* same as 21.1 */
int str_as_unibyte (str, bytes)
/* same as 21.1 */
DEFUN ("string", ...)
## CHARSET ##
### Definition
A charset is an object that defines a mapping of
"code-point"<->"character" for a group of characters. Most charsets
corresponds to external CCS (coded character set, e.g. Unicode, ISO/IEC
8859/1, JISX 0208.1983). Some exist only in Emacs
(e.g. eight-bit-control, eight-bit-graphic, emacs (this contains all
characters)).
Each language environment reorder a charset list by a different
priorities. The ordered charset list is used in these case:
o selecting a font
o selecting a proper coding-system for encoding
o unibyte<->multibyte conversion
### Implementation detail
A charset has these attributes:
o name
o docstring
o dimension -- 0, 1, or 2.
o chars -- 94, 96, 128, or 256.
o short_name
o long_name
o iso_final_char
o iso_graphic_plane -- `gl' or `gr'.
o iso_revision_number
o emacs_mule_id
The id number of the current mule charsets.
o ascii-compatible-p
Non-nil iff the charset is a superset of `ascii' charset.
o plist
o min-code -- minimum code-point
o max-code -- maximum code-point
o min-char -- minimum character code
o max-char -- maximum character code
o code-offset -- integer or nil
If integer, code-point + `code-offset' == character
(if `chars' is 94 or 96, we use a little bit more
complicated calculation to make the char space compact).
If nil, encode/decode-char-table is used.
o encode-char-table -- char-table or nil
If char-table, (aref ENCODE-CHAR-TABLE C) is a code-point of
C, or -1 if C doesn't belongs to the charset.
If nil, `code-offset' is used.
o decode-char-table -- char-table or nil
If char-table, (aref DECODE-VECTOR CODE) is a character for
CODE, or -1 if CODE is invalid.
If nil, `code-offset' is used.
o charset-map -- vector, string, or nil
If both `code-offset' and `encode/decode-char-table' are nil,
`charset-map' is used to generate `encode/decode-char-table'.
If it is vector, the format is [CODE1 CHAR1 CODE2 CHAR2 ...].
If is is a string, it's a name of a file that contains a
mapping data.
In both cases, once it is handled, this value is set to nil.
o id-number
We store a vector of these attribute values in the internal hash table
Vcharset_hash_table (key is a charset symbol) that is not directly
accessed by Elisp. The id-number is an index in the hash table.
In Elisp, a charset is identified by a symbol. In C, a charset is
identified by an id-number.
It may be good to have the following enum in charset.h (xxx stands for
attribute name).
enum charset_attribute_idx
{
charset_xxx,
...
};
#### C level APIs in charset.h
/* Return the value of attribute XXX, YYY, ... of charset whose
attribute vector is ATTR. */
#define CHARSET_ATTR_XXX(attr)
#define CHARSET_ATTR_YYY(attr)
...
/* Return the attribute vector of charset whose symbol is SYMBOL. */
#define CHARSET_SYMBOL_ATTRIBUTE(symbol)
/* Return the value of attribute XXX, YYY, ... of charset whose symbol
is SYMBOL. Use the macro CHARSET_SYMBOL_ATTRIBUTE. */
#define CHARSET_SYMBOL_XXX(symbol)
#define CHARSET_SYMBOL_YYY(symbol)
...
/* Return the attribute vector of CHARSET. */
#define CHARSET_ATTRIBUTE(charset)
/* Return the value of attribute XXX, YYY, ... of CHARSET. Use the
macro CHARSET_ATTRIBUTE. */
#define CHARSET_XXX(charset)
#define CHARSET_YYY(charset)
...
/* Return an index to Vcharset_hash_table of the charset whose
symbol is SYMBOL. */
#define CHARSET_SYMBOL_HASH_IDX(symbol)
/* Nonzero iff OBJ is a valid charset symbol. */
#define CHARSETP(obj)
/* Check if X is a valid charset symbol. If not, signal an error. */
#define CHECK_CHARSET(x, i)
/* Check if X is a valid charset symbol. If valid, set ID to the id
number of the charset. Otherwise, signal an error. */
#define CHECK_CHARSET_GET_ID(x, i, id)
/* Check if X is a valid charset symbol. If valid, set ATTR to the
attr vector of the charset. Otherwise, signal an error. */
#define CHECK_CHARSET_GET_ATTR(x, i, attr)
/* Lookup Vcharset_order_list and return the first charset that
contains the character C. */
#define CHAR_CHARSET(c)
/* Return a character corresponding to the code-point CODE of CHARSET.
Do some optimization than calling decode_char directly. */
#define DECODE_CHAR(charset, code)
/* Return a code point of CHAR in CHARSET.
Do some optimization than calling decode_char directly. */
#define ENCODE_CHAR(charset, c)
/* Set CHARSET to the charset highest priority of C, CODE to the
code-point of C in CHARSET. */
#define SPLIT_CHAR(c, charset, code)
#define ISO_MAX_DIMENSION 3
#define ISO_MAX_CHARS 2
#define ISO_MAX_FINAL 0x80 /* only 0x30..0xFF are used */
/* Mapping table from ISO2022's charset (specified by DIMENSION,
CHARS, and FINAL_CHAR) to Emacs' charset ID. Should be accessed by
macro ISO_CHARSET_TABLE (DIMENSION, CHARS, FINAL_CHAR). */
extern int iso_charset_table[ISO_MAX_DIMENSION][ISO_MAX_CHARS][ISO_MAX_FINAL];
/* A charset of type iso2022 who has DIMENSION, CHARS, and FINAL
(final character). */
#define ISO_CHARSET_TABLE(dimension, chars, final) \
iso_charset_table[(dimension) - 1][(chars) == 96][(final)]
#### C level APIs in charset.c
/* The primary charset. It is a charset of unibyte characters. */
int charset_primary;
/* Hash table that contains attributes of each charset. Keys are
charset symbols, and values are vectors of charset attributes. */
Lisp_Object Vcharset_hash_table;
/* List of charsets ordered by the priority. */
Lisp_Object Vcharset_ordered_list;
/* List of iso-2022 charsets. */
Lisp_Object Viso2022_charset_list;
/* List of emacs-mule charsets. */
Lisp_Object Vemacs_mule_charset_list;
/* Mapping table from ISO2022's charset (specified by DIMENSION,
CHARS, and FINAL-CHAR) to Emacs' charset. */
int iso_charset_table[ISO_MAX_DIMENSION][ISO_MAX_CHARS][ISO_MAX_FINAL];
DEFUN ("charsetp", Fcharsetp, Scharsetp, 1, 1, 0,
"Return non-nil if and only if OBJECT is a charset.")
/* This function should not return the attribute vector itself, but the
copy. In addition, the copy should contain only `name' to `max-char'. */
DEFUN ("charset-attributes", Fcharset_attributes, Scharset_attributes, 1, 1, 0
"Return an attribute vector of CHARSET.")
/* This function returns the attribute `plist' of CHARSET. Here, we
don't have to copy it because `plist' is never used by C code. */
DEFUN ("charset-plist", Fcharset_plist, Scharset_plist, 1, 1, 0
"Return a property list of CHARSET.")
DEFUN ("set-charset-plist", Fset_charset_plist, Sset_charset_plist, 2, 2, 0
"Set CHARSET's property list to PLIST, and return PLIST.")
/* Parse the code mapping vector VEC and setup char-tables for
encoding and decoding in the charset attribute vector ATTRS. Set
`charset_map' attribute of the charset to Qnil to indicate that the
vector is already processed. Return 0 if char-tables are
successfully parsed, otherwise return -1. */
static int
parse_charset_map_vector (vec, attrs)
Lisp_Object vec, attrs;
/* Parse the contents of code mapping file FILENAME and setup
char-tables for encoding and decoding in the charset attribute
vector ATTRS. Set `charset_map' attribute of the charset to Qnil
to indicate that the file is already read. Return 0 if char-tables are
successfully parsed, otherwise return -1. */*/
static int
read_charset_map_file (filename, attrs)
Lisp_Object vec, attrs;
/* Define a charset according to the arguments. The Nth argument is
the Nth attribute of the charset (the last attribute `charset-id'
is not included). See the docstring of `define-charset' for the
detail. */
DEFUN ("define-charset-internal", Fdefine_charset_internal,
Sdefine_charset_internal, charset_encode_table, MANY, 0,
"For internal use only.")
/* same as 21.1 */
DEFUN ("get-unused-iso-final-char", ...)
/* same as 21.1 */
DEFUN ("declare-equiv-charset", ...)
/* same as 21.1 */
int string_xstring_p (string)
/* I'm not sure how we can utilize these functions. And, it is
difficult to implement them efficiently. If these are used only for
finding a proper coding system, we may need a different APIs. */
DEFUN ("find-charset-region", ...)
DEFUN ("find-charset-string", ...)
/* Set ATTRS to the attribute vector of CHARSET. If both
`code-offset' and `encode-char-table' attributes are nil and
`charset-map' attribute is non-nil, process `charset-map'. If it
fails, set ATTRS to Qnil. */
#define CHARSET_GET_ATTRIBUTE(charset, attrs)
/* Return a character corresponding to the code-point CODE of
CHARSET. */
int
decode_char (charset, code)
int charset, code;
/* Return a code-point of CHAR in CHARSET. */
int
encode_char (charset, c)
int charset;
int c;
/* defined in mule.el in 21.1 */
DEFUN ("decode-char", ...)
/* defined in mule.el in 21.1 */
DEFUN ("encode-char", ...)
/* same as 21.1. This exists just for backward compatibility. New
code should always use `decode-char'. */
DEFUN ("make-char", ...)
/* Return the first charset in CHARSET_LIST that contains C.
CHARSET_LIST is a list of charsets. If it is nil, use
Vcharset_ordered_list. */
int
char_charset (c, charset_list)
int c;
Lisp_Object charset_list;
/* almost same as 21.1. The difference is that the returned charset
is the first one in Vcharset_ordered_list that contains the
character. */
DEFUN ("split-char", ...)
/* same as above */
DEFUN ("char-charset", ...)
/* same as above */
DEFUN ("charset-after", ...)
/* same is 21.1 */
DEFUN ("iso-charset", ...)
## CODING-SYSTEM ##
### Definition
A coding-system is an object that defines a mapping of
"byte-sequence"<->"sequence of charset/code-point pairs". A decoder
of a coding-system at first converts a byte-sequence to a sequence of
charset/code-point pairs, then converts the pair sequence to a
character sequence by using a mapping defined for each charset. A
encoder do the opposite thing.
### Implementation detail
A coding-system has these attributes:
o name
o docstring
o mnemonic
o coding-type
o eol-type
o charset-list
-- list of charsets supported in the coding-system
o ascii-compatible-p
-- non-nil if the coding-system encode ASCII chars as is
o decode-translation-table
o encode-translation-table
o post-read-function
o pre-write-function
o default-char
-- non-nil if the coding-system encodes ASCII chars as is
o composition-p
-- non-nil if the coding-system produces ESC sequence for
compositions
o direction-p
-- non-nil if the coding-system produces ESC sequence for
compositions
o flushing-p
-- non-nil if the coding-system requires flushing out some
bytes after encoding a text.
o plist -- property list containing only informative data not used by
C code
o category
o aliases
o charset-valid-codes
-- used only if coding-type is `charset'.
unibyte string of length 256. If Nth element is nonzero,
the byte code N is valid.
o ccl-decoder
-- used only if coding-type is `ccl'.
o ccl-encoder,
-- used only if coding-type is `ccl'.
o ccl-valids-codes
-- used only if coding-type is `ccl'.
o iso-initial-designation
-- used only if coding-type is `iso2022.
o iso-requested-designation
-- used only if coding-type is `iso2022.
o iso-flags
-- used only if coding-type is `iso2022.
We store a vector of these attribute values in a internal hash table
that is not directly accessed by Elisp. In Elisp and C, a coding
system is identified by a symbol.
It may be good to have the following enum in coding.h (xxx stands for
attribute name).
enum coding_attribute_idx
{
coding_xxx,
...
};
On code-conversion, we at first initialize a struct coding_context
(in 21.1, this is struct coding_system) from a specified
coding-system.
I'll write the detail later.
## CHAR-TABLE ##
### Definition
A char-table is an object that defines a specific property of
characters. A char-table can be looked up by any character. A
char-table can have a parent. In that case, if a value for a
character is nil, the parent char-table is looked up recursively.
### Implementation detail
A char-table is implemented by a nested vector.
The first level has at most 64 slots. Each slot is for #x10000 characters.
The second level has 16 slots. Each slot is for #x1000 characters.
The third level has 32 slots. Each slot is for #x7F characters.
The fourth level has 128 slots. Each slot is for a specific character.
I'll write the more detail later.
## FONTSPEC and FONTSET ##
### Definition
A fontspec is an object that defines attributes of a font. For the
moment, we consider these attributes:
FOUNDRY FAMILY WEIGHT SLANT SWIDTH ADSTYLE POINTSIZE REGISTRY
Emacs always uses a fontspec instead of a fontname for asking Emacs to
use some font. But, a fontspec can be created from a fontname in a
window-system dependent manner, for convenience.
If a specific attribute is nil, that means that any value is
acceptable.
A fontset is an alist of charset vs. fontspec. To display a character
C, Emacs looks up `charset-ordered-list' to find the first charset
that contains C and also has an entry in a fontset of the selected
face. Then, the fontspec corresponding to the found charset is merged
with font-related attributes of the selected face. Then the merged
fontspec is used to find an actual font.
Attributes of fontspec in a fontset are usually all nil except for
REGISTRY.
### Implementation detail
I'll write the detail later.
## Local Variables:
## outline-regexp: "##+"
## eval: (hide-sublevels 1)
## End:
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/