current idea

Kenichi Handa Wed, 31 Oct 2001 21:46:34 -0800

I'm sorry for not responding so long.

In these months, I have been working on the design of
Unicode-base Emacs, and to verify that design, I've also
been working on a sample implementation codes.  But these
are done in a very slow pace.  :-(


Here, I'll post a memo about my current idea.  It will be
convenient to save this file and read in outline mode.

---
Ken'ichi HANDA
[EMAIL PROTECTED]

## Design of Unicode-based Emacs  -*- outline -*-
### Ver.1.0  2001.11.1

## CHARACTER ##

### Definition

A character. :-p

#### Implementation detail

A character is represented by a number (a character code), both in
Elisp and C.

The code space is 22-bit.  Thus, Emacs can handle 2^22 different
characters.  Each character belongs to one or more charsets.

A character in a buffer/string is represented by a multibyte sequence
(UTF-8 format) of at most 4-byte.

#### C level APIs in character.h

/* Return a Lisp character whose code is C. */
#define make_char(c)

/* Nonzero iff C is an ASCII byte.  */
#define ASCII_BYTE_P(c)

/* Check if Lisp object X is a character or not.  */
#define CHECK_CHARACTER(x, i)

/* Nonzero iff C is a valid character code.  */
#define CHAR_VALID_P(c)

/* Nonzero iff C is an ASCII character.  */
#define ASCII_CHAR_P(c)

/* Nonzero iff C is an ASCII character or a control character.  */
#define ASCII_OR_CONTROL_P(c)

/* Nonzero iff C is a character of code less than 0x100.  */
#define SINGLE_BYTE_CHAR_P(c)

/* Nonzero if character C has valid printable glyph.  */
#define CHAR_PRINTABLE_P(c)

/* How many bytes C occupies in a multibyte buffer.  */
#define CHAR_BYTES(c)

/* How many columns C occupies on a screen.  */
#define CHAR_WIDTH(c)

/* Store multibyte form of the character C in STR.  The caller should
   allocate at least MAX_MULTIBYTE_LENGTH bytes area at STR in
   advance.  Returns the length of the multibyte form.  */
#define CHAR_STRING(c, p)

/* Like CHAR_STRING, but advance P to the end of the multibyte
   form.  */
#define CHAR_STRING_ADVANCE(c, p)

/* Like CHAR_STRING_ADVANCE but it is assured that C is 0..255.  */
#define BYTE_STRING_ADVANCE(c, p)

/* Nonzero iff BYTE starts a character in a multibyte form.  */
#define CHAR_HEAD_P(byte)

/* Nonzero iff BYTE starts a non-ASCII character in a multibyte
   form.  */
#define NON_ASCII_CHAR_HEAD_P(byte)

/* How many bytes a character that starts with BYTE occupies in a
   multibyte form.  */
#define BYTES_BY_CHAR_HEAD(byte)

/* The byte length of multibyte form at unibyte string P ending at
   PEND.  If STR doesn't point a valid multibyte form, return 0.  */
#define MULTIBYTE_LENGTH(p, pend)

/* Like MULTIBYTE_LENGTH but don't check the ending address.  */
#define MULTIBYTE_LENGTH_NO_CHECK(p)

/* Return the character code of character whose multibyte form is at P
   and the length is LEN.  */
#define STRING_CHAR(p)

/* Like STRING_CHAR but set LEN to the length of multibyte form.  */
#define STRING_CHAR_AND_LENGTH(str, len)

/* Like STRING_CHAR but advance STR to the end of multibyte form.  */
#define STRING_CHAR_ADVANCE(str)

/* Fetch the "next" character from Lisp string STRING at byte position
   BYTEIDX, character position CHARIDX.  Store it into OUTPUT.

   All the args must be side-effect-free.
   BYTEIDX and CHARIDX must be lvalues;
   we increment them past the character fetched.  */
#define FETCH_STRING_CHAR_ADVANCE(OUTPUT, STRING, CHARIDX, BYTEIDX)

/* Like FETCH_STRING_CHAR_ADVANCE but assumes STRING is multibyte.  */
#define FETCH_STRING_CHAR_ADVANCE_NO_CHECK(OUTPUT, STRING, CHARIDX, BYTEIDX)

/* Like FETCH_STRING_CHAR_ADVANCE but fetch character from the current
   buffer.  */
#define FETCH_BUFFER_CHAR_ADVANCE(OUTPUT, CHARIDX, BYTEIDX)

/* Like FETCH_BUFFER_CHAR_ADVANCE but assumes STRING is multibyte.  */
#define FETCH_BUFFER_CHAR_ADVANCE_NO_CHECK(OUTPUT, CHARIDX, BYTEIDX)

/* Increase the buffer byte position POS_BYTE of the current buffer to
   the next character boundary.  No range checking of POS.  */
#define INC_POS(pos_byte)

/* Decrease the buffer byte position POS_BYTE of the current buffer to
   the previous character boundary.  No range checking of POS.  */
#define DEC_POS(pos_byte)

/* Increment both CHARPOS and BYTEPOS, each in the appropriate way.  */
#define INC_BOTH(charpos, bytepos)

/* Decrement both CHARPOS and BYTEPOS, each in the appropriate way.  */
#define DEC_BOTH(charpos, bytepos)

/* Increase the buffer byte position POS_BYTE of the current buffer to
   the next character boundary.  This macro relies on the fact that
   *GPT_ADDR and *Z_ADDR are always accessible and the values are
   '\0'.  No range checking of POS_BYTE.  */
#define BUF_INC_POS(buf, pos_byte)

/* Decrease the buffer byte position POS_BYTE of the current buffer to
   the previous character boundary.  No range checking of POS_BYTE.  */
#define BUF_DEC_POS(buf, pos_byte)

#### C level APIs in character.c

/* Vector of translation table ever defined.
   ID of a translation table is used to index this vector.  */
Lisp_Object Vtranslation_table_vector;

/* A char-table for characters which may invoke auto-filling.  */
Lisp_Object Vauto_fill_chars;

/* A char-table.  An element is non-nil iff the corresponding
   character has a printable glyph.  */
Lisp_Object Vprintable_char_table;

/* A char-table.  An element is a column-width of the corresponding
   character.  */
Lisp_Object Vchar_width_table;


/* Translate character C by translation table TABLE.  If C is
   negative, translate a character specified by CHARSET and CODE.  If
   no translation is found in TABLE, return the untranslated
   character.  */

int
translate_char (table, c, charset, code)
     Lisp_Object table;
     int c, charset, code;

/* Convert the unibyte character C to the corresponding multibyte
   character based on the current value of charset_primary.  If C
   can't be converted, return C.  */

int
unibyte_char_to_multibyte (c)
     int c;

/* Convert the multibyte character C to the corresponding unibyte
   character based on the current value of charset_primary.  If
   dimension of charset_primary is more than one, return (C &
   0xFF).  */

int
multibyte_char_to_unibyte (c)
     int c;

DEFUN ("characterp", Fcharacterp, Scharacterp, 1, 1, 0,
  "Return non-nil if OBJECT is a character.")

DEFUN ("max-char", Fmax_char, Smax_char, 0, 0, 0,
  "Return the character of the maximum code.")

DEFUN ("unibyte-char-to-multibyte", Funibyte_char_to_multibyte,
       Sunibyte_char_to_multibyte, 1, 1, 0,
  "Convert the unibyte character CH to multibyte character.\n\
The multibyte character is a result of decoding CH by\n\
the current primary charset (value of `charset-primary').")

DEFUN ("multibyte-char-to-unibyte", Fmultibyte_char_to_unibyte,
       Smultibyte_char_to_unibyte, 1, 1, 0,
  "Convert the multibyte character CH to unibyte character.\n\
The unibyte character is a result of encoding CH by\n\
the current primary charset (value of `charset-primary').")

/* same as 21.1 */
DEFUN ("char-bytes", ...)

/* same as 21.1 */
DEFUN ("char-width", ...)

/* same as 21.1 */
int strwidth (str, len)

/* same as 21.1 */
int lisp_string_width (str)

/* same as 21.1 */
DEFUN ("string-width", ...)

/* same as 21.1 */
DEFUN ("chars-in-region", ...)

/* same as 21.1 */
int chars_in_text (ptr, nbytes)

/* same as 21.1 */
int multibyte_chars_in_text (ptr, nbytes)

/* same as 21.1 */
void parse_str_as_multibyte (str, len, nchars, nbytes)

/* same as 21.1 */
int str_as_multibyte (str, len, nbytes, nchars)

/* same as 21.1 */
int str_to_multibyte (str, len, bytes)

/* same as 21.1 */
int str_as_unibyte (str, bytes)

/* same as 21.1 */
DEFUN ("string", ...)


## CHARSET ##

### Definition

A charset is an object that defines a mapping of
"code-point"<->"character" for a group of characters.  Most charsets
corresponds to external CCS (coded character set, e.g. Unicode, ISO/IEC
8859/1, JISX 0208.1983).  Some exist only in Emacs
(e.g. eight-bit-control, eight-bit-graphic, emacs (this contains all
characters)).

Each language environment reorder a charset list by a different
priorities.  The ordered charset list is used in these case:
  o selecting a font
  o selecting a proper coding-system for encoding
  o unibyte<->multibyte conversion

### Implementation detail

A charset has these attributes:

o name
o docstring
o dimension -- 0, 1, or 2.
o chars -- 94, 96, 128, or 256.
o short_name
o long_name
o iso_final_char
o iso_graphic_plane -- `gl' or `gr'.
o iso_revision_number
o emacs_mule_id
        The id number of the current mule charsets.
o ascii-compatible-p
        Non-nil iff the charset is a superset of `ascii' charset.
o plist
o min-code -- minimum code-point
o max-code -- maximum code-point
o min-char -- minimum character code
o max-char -- maximum character code
o code-offset -- integer or nil
        If integer, code-point + `code-offset' == character
           (if `chars' is 94 or 96, we use a little bit more
            complicated calculation to make the char space compact).
        If nil, encode/decode-char-table is used.
o encode-char-table -- char-table or nil
        If char-table, (aref ENCODE-CHAR-TABLE C) is a code-point of
        C, or -1 if C doesn't belongs to the charset.
        If nil, `code-offset' is used.
o decode-char-table -- char-table or nil
        If char-table, (aref DECODE-VECTOR CODE) is a character for
        CODE, or -1 if CODE is invalid.
        If nil, `code-offset' is used.
o charset-map -- vector, string, or nil
        If both `code-offset' and `encode/decode-char-table' are nil,
        `charset-map' is used to generate `encode/decode-char-table'.
        If it is vector, the format is [CODE1 CHAR1 CODE2 CHAR2 ...].
        If is is a string, it's a name of a file that contains a
        mapping data.
        In both cases, once it is handled, this value is set to nil.
o id-number

We store a vector of these attribute values in the internal hash table
Vcharset_hash_table (key is a charset symbol) that is not directly
accessed by Elisp.  The id-number is an index in the hash table.

In Elisp, a charset is identified by a symbol.  In C, a charset is
identified by an id-number.

It may be good to have the following enum in charset.h (xxx stands for
attribute name).

enum charset_attribute_idx
{
  charset_xxx,
  ...
};

#### C level APIs in charset.h

/* Return the value of attribute XXX, YYY, ... of charset whose
   attribute vector is ATTR.  */
#define CHARSET_ATTR_XXX(attr)
#define CHARSET_ATTR_YYY(attr)
...

/* Return the attribute vector of charset whose symbol is SYMBOL.  */
#define CHARSET_SYMBOL_ATTRIBUTE(symbol)

/* Return the value of attribute XXX, YYY, ... of charset whose symbol
   is SYMBOL.  Use the macro CHARSET_SYMBOL_ATTRIBUTE.  */
#define CHARSET_SYMBOL_XXX(symbol)
#define CHARSET_SYMBOL_YYY(symbol)
...

/* Return the attribute vector of CHARSET.  */
#define CHARSET_ATTRIBUTE(charset)

/* Return the value of attribute XXX, YYY, ... of CHARSET.  Use the
   macro CHARSET_ATTRIBUTE.  */
#define CHARSET_XXX(charset)
#define CHARSET_YYY(charset)
...

/* Return an index to Vcharset_hash_table of the charset whose
   symbol is SYMBOL.  */
#define CHARSET_SYMBOL_HASH_IDX(symbol)

/* Nonzero iff OBJ is a valid charset symbol.  */
#define CHARSETP(obj)

/* Check if X is a valid charset symbol.  If not, signal an error.  */
#define CHECK_CHARSET(x, i)

/* Check if X is a valid charset symbol.  If valid, set ID to the id
   number of the charset.  Otherwise, signal an error. */
#define CHECK_CHARSET_GET_ID(x, i, id)

/* Check if X is a valid charset symbol.  If valid, set ATTR to the
   attr vector of the charset.  Otherwise, signal an error. */
#define CHECK_CHARSET_GET_ATTR(x, i, attr)

/* Lookup Vcharset_order_list and return the first charset that
   contains the character C.  */
#define CHAR_CHARSET(c)

/* Return a character corresponding to the code-point CODE of CHARSET.
   Do some optimization than calling decode_char directly.  */
#define DECODE_CHAR(charset, code)

/* Return a code point of CHAR in CHARSET.
   Do some optimization than calling decode_char directly.  */
#define ENCODE_CHAR(charset, c)

/* Set CHARSET to the charset highest priority of C, CODE to the
   code-point of C in CHARSET.  */
#define SPLIT_CHAR(c, charset, code)

#define ISO_MAX_DIMENSION 3
#define ISO_MAX_CHARS 2
#define ISO_MAX_FINAL 0x80      /* only 0x30..0xFF are used */

/* Mapping table from ISO2022's charset (specified by DIMENSION,
   CHARS, and FINAL_CHAR) to Emacs' charset ID.  Should be accessed by
   macro ISO_CHARSET_TABLE (DIMENSION, CHARS, FINAL_CHAR).  */
extern int iso_charset_table[ISO_MAX_DIMENSION][ISO_MAX_CHARS][ISO_MAX_FINAL];

/* A charset of type iso2022 who has DIMENSION, CHARS, and FINAL
   (final character).  */
#define ISO_CHARSET_TABLE(dimension, chars, final)      \
  iso_charset_table[(dimension) - 1][(chars) == 96][(final)]


#### C level APIs in charset.c

/* The primary charset.  It is a charset of unibyte characters.  */
int charset_primary;

/* Hash table that contains attributes of each charset.  Keys are
   charset symbols, and values are vectors of charset attributes.  */
Lisp_Object Vcharset_hash_table;

/* List of charsets ordered by the priority.  */
Lisp_Object Vcharset_ordered_list;

/* List of iso-2022 charsets.  */
Lisp_Object Viso2022_charset_list;

/* List of emacs-mule charsets.  */
Lisp_Object Vemacs_mule_charset_list;

/* Mapping table from ISO2022's charset (specified by DIMENSION,
   CHARS, and FINAL-CHAR) to Emacs' charset.  */
int iso_charset_table[ISO_MAX_DIMENSION][ISO_MAX_CHARS][ISO_MAX_FINAL];


DEFUN ("charsetp", Fcharsetp, Scharsetp, 1, 1, 0,
       "Return non-nil if and only if OBJECT is a charset.")

/* This function should not return the attribute vector itself, but the
   copy.   In addition, the copy should contain only `name' to `max-char'.  */
DEFUN ("charset-attributes", Fcharset_attributes, Scharset_attributes, 1, 1, 0
       "Return an attribute vector of CHARSET.")

/* This function returns the attribute `plist' of CHARSET.  Here, we
   don't have to copy it because `plist' is never used by C code.  */
DEFUN ("charset-plist", Fcharset_plist, Scharset_plist, 1, 1, 0
       "Return a property list of CHARSET.")

DEFUN ("set-charset-plist", Fset_charset_plist, Sset_charset_plist, 2, 2, 0
       "Set CHARSET's property list to PLIST, and return PLIST.")

/* Parse the code mapping vector VEC and setup char-tables for
   encoding and decoding in the charset attribute vector ATTRS.  Set
   `charset_map' attribute of the charset to Qnil to indicate that the
   vector is already processed.  Return 0 if char-tables are
   successfully parsed, otherwise return -1.  */

static int
parse_charset_map_vector (vec, attrs)
     Lisp_Object vec, attrs;

/* Parse the contents of code mapping file FILENAME and setup
   char-tables for encoding and decoding in the charset attribute
   vector ATTRS.  Set `charset_map' attribute of the charset to Qnil
   to indicate that the file is already read.  Return 0 if char-tables are
   successfully parsed, otherwise return -1.  */*/

static int
read_charset_map_file (filename, attrs)
     Lisp_Object vec, attrs;

/* Define a charset according to the arguments.  The Nth argument is
   the Nth attribute of the charset (the last attribute `charset-id'
   is not included).  See the docstring of `define-charset' for the
   detail.  */

DEFUN ("define-charset-internal", Fdefine_charset_internal,
       Sdefine_charset_internal, charset_encode_table, MANY, 0,
  "For internal use only.")

/* same as 21.1 */ 
DEFUN ("get-unused-iso-final-char", ...)

/* same as 21.1 */ 
DEFUN ("declare-equiv-charset", ...)

/* same as 21.1 */ 
int string_xstring_p (string)

/* I'm not sure how we can utilize these functions.  And, it is
   difficult to implement them efficiently.  If these are used only for
   finding a proper coding system, we may need a different APIs.  */
DEFUN ("find-charset-region", ...)
DEFUN ("find-charset-string", ...)


/* Set ATTRS to the attribute vector of CHARSET.  If both
   `code-offset' and `encode-char-table' attributes are nil and
   `charset-map' attribute is non-nil, process `charset-map'.  If it
   fails, set ATTRS to Qnil.  */
#define CHARSET_GET_ATTRIBUTE(charset, attrs)

/* Return a character corresponding to the code-point CODE of
   CHARSET.  */
int
decode_char (charset, code)
     int charset, code;

/* Return a code-point of CHAR in CHARSET.  */

int
encode_char (charset, c)
     int charset;
     int c;

/* defined in mule.el in 21.1 */
DEFUN ("decode-char", ...)

/* defined in mule.el in 21.1 */
DEFUN ("encode-char", ...)

/* same as 21.1.  This exists just for backward compatibility.  New
   code should always use `decode-char'.  */
DEFUN ("make-char", ...)

/* Return the first charset in CHARSET_LIST that contains C.
   CHARSET_LIST is a list of charsets.  If it is nil, use
   Vcharset_ordered_list.  */
int
char_charset (c, charset_list)
     int c;
     Lisp_Object charset_list;

/* almost same as 21.1.  The difference is that the returned charset
   is the first one in Vcharset_ordered_list that contains the
   character.  */
DEFUN ("split-char", ...)

/* same as above */
DEFUN ("char-charset", ...)

/* same as above */
DEFUN ("charset-after", ...)

/* same is 21.1 */
DEFUN ("iso-charset", ...)



## CODING-SYSTEM ##

### Definition

A coding-system is an object that defines a mapping of
"byte-sequence"<->"sequence of charset/code-point pairs".  A decoder
of a coding-system at first converts a byte-sequence to a sequence of
charset/code-point pairs, then converts the pair sequence to a
character sequence by using a mapping defined for each charset.  A
encoder do the opposite thing.

### Implementation detail

A coding-system has these attributes:

o name
o docstring
o mnemonic
o coding-type
o eol-type
o charset-list
        -- list of charsets supported in the coding-system
o ascii-compatible-p
        -- non-nil if the coding-system encode ASCII chars as is
o decode-translation-table
o encode-translation-table
o post-read-function
o pre-write-function
o default-char
        -- non-nil if the coding-system encodes ASCII chars as is
o composition-p
        -- non-nil if the coding-system produces ESC sequence for
           compositions
o direction-p
        -- non-nil if the coding-system produces ESC sequence for
           compositions
o flushing-p
        -- non-nil if the coding-system requires flushing out some
           bytes after encoding a text.
o plist -- property list containing only informative data not used by
           C code
o category
o aliases
o charset-valid-codes
        -- used only if coding-type is `charset'.
           unibyte string of length 256.  If Nth element is nonzero,
           the byte code N is valid.
o ccl-decoder
        -- used only if coding-type is `ccl'.
o ccl-encoder,
        -- used only if coding-type is `ccl'.
o ccl-valids-codes
        -- used only if coding-type is `ccl'.
o iso-initial-designation
        -- used only if coding-type is `iso2022.
o iso-requested-designation
        -- used only if coding-type is `iso2022.
o iso-flags
        -- used only if coding-type is `iso2022.

We store a vector of these attribute values in a internal hash table
that is not directly accessed by Elisp.  In Elisp and C, a coding
system is identified by a symbol.

It may be good to have the following enum in coding.h (xxx stands for
attribute name).

enum coding_attribute_idx
{
  coding_xxx,
  ...
};

On code-conversion, we at first initialize a struct coding_context
(in 21.1, this is struct coding_system) from a specified
coding-system.

I'll write the detail later.


## CHAR-TABLE ##

### Definition

A char-table is an object that defines a specific property of
characters.  A char-table can be looked up by any character.  A
char-table can have a parent.  In that case, if a value for a
character is nil, the parent char-table is looked up recursively.

### Implementation detail

A char-table is implemented by a nested vector.
The first level has at most 64 slots.  Each slot is for #x10000 characters.
The second level has 16 slots.  Each slot is for #x1000 characters.
The third level has 32 slots.  Each slot is for #x7F characters.
The fourth level has 128 slots.  Each slot is for a specific character.

I'll write the more detail later.


## FONTSPEC and FONTSET ##

### Definition

A fontspec is an object that defines attributes of a font.  For the
moment, we consider these attributes:
    FOUNDRY FAMILY WEIGHT SLANT SWIDTH ADSTYLE POINTSIZE REGISTRY 

Emacs always uses a fontspec instead of a fontname for asking Emacs to
use some font.  But, a fontspec can be created from a fontname in a
window-system dependent manner, for convenience.

If a specific attribute is nil, that means that any value is
acceptable.

A fontset is an alist of charset vs. fontspec.  To display a character
C, Emacs looks up `charset-ordered-list' to find the first charset
that contains C and also has an entry in a fontset of the selected
face.  Then, the fontspec corresponding to the found charset is merged
with font-related attributes of the selected face.   Then the merged
fontspec is used to find an actual font.

Attributes of fontspec in a fontset are usually all nil except for
REGISTRY.

### Implementation detail

I'll write the detail later.


## Local Variables:
## outline-regexp: "##+"
## eval: (hide-sublevels 1)
## End:
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

current idea

Reply via email to