Re: [Gcl-devel] utf8 and emacs text/string multibyte representation

Camm Maguire Sat, 01 Nov 2014 12:28:07 -0700

Greetings, and thanks so much!  I think we are converging...

1) The proposal under consideration is due to Carl, that gcl's lisp
character still be governed by char-code-limit==256, i.e. equivalent to
an uint8_t.  aref/aset work the same for all types of arrays.  This lisp
character has no correspondence to a unicode character other than the
overlap in the ascii range.  In some fashion, gcl would then provide on
top of these primitives (unichar s i), etc. to get unicodes from utf8
encoded strings.  These are not random access, but can be cached. So
(code-char #xa0) != no-break-space.

2) To the extent that anyone constructs unicode strings from individual
codepoints, the use of these routines will ensure that the utf8 output
is correct.  An improperly formatted utf8-encoding will serve as valid
input into aref, but not #'unichar, or whatever we call it.

3) There appear to be two meanings of alphabetic in the Hyperspec, which
are not the same.  The first is a graphic character with case, the
second is a designator for constituent characters which do not separate
tokens.  

alphabetic n., adj. 1. adj. (of a character) being one of the standard
characters A through Z or a through z, or being any
implementation-defined character that has case, or being some other
graphic character defined by the implementation to be
alphabetic[1]. 2. a. n. one of several possible constituent traits of a
character. For details, see Section 2.1.4.1 (Constituent Characters) and
Section 2.2 (Reader Algorithm). b. adj. (of a character) being a
character that has syntax type constituent in the current readtable and
that has the constituent trait alphabetic[2a]. See Figure 2-8.

Defining octets >=128 as alpha-char-p means they are not used as token
separators, at least in gcl's default reader.  This makes sense for the
input of a pair of octets representing no-break-space, as presumably
this is a constituent character too.  It does not make sense if one
assumes that there must be distinct octets for the pair in
no-break-space that correspond to the opposite case.  So there appears
to be a bit of an ambiguity here.  Are there any non-constituent unicode
codepoints in the non-ascii range? (Assuming yes, but probably not
important.)

4) I think a dominant consideration here are the forms of most probable
input and output.  Files, terminals, even cut-paste from emacs buffers,
all transfer valid utf8 encoded byte sequences into GCL which then
intern, print, and string-compare correctly.  Asking the unusual user
who might want to set strings directly via their unicode codepoints to
use a setf on unichar instead of aref, or better yet a unicode-char
which outputs a string for concatenation, seems a small price to pay.

Just thoughts...

Take care,

Raymond Toy <[email protected]> writes:

>>>>>> "Matt" == Matt Kaufmann <[email protected]> writes:
>
>     Matt> I saw your question and was curious, so I looked into it a bit:
>     >>> To your knowledge, is there any objection to defining alpha-char-p as
>     >>> including code-char's >= 128?
>
>     Matt> I see that SBCL 1.2.2 is OK with that, for example:
>
>     Matt> * (code-char 232)
>
>     Matt> #\LATIN_SMALL_LETTER_E_WITH_GRAVE
>     Matt> * (alpha-char-p (code-char 232))
>
>     Matt> T
>     Matt> * 
>
>     Matt> In fact, that alpha-char-p call also returns T in (versions of)
>     Matt> Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL.
>
> Try (code-char #xa0). This is the unicode character
> no-break-space. This has no case and would presumably not be
> alpha-char-p. I think there are quite a few characters that would not
> be (from cmucl):
>
> (count nil (loop for k from 128 upto 255 collect (alpha-char-p (code-char 
> k))))
> 63
>
>
> I think there is some confusion here, at least for me. If gcl uses
> 8-bit code-units and utf-8 strings, what exactly is (coode-char 232)? 
> You can store that into a utf-8 string but it won't be
> #\latin_small_letter_e_with_grave because that would be encoded as two
> octets in a utf-8 string: 195 168.
>
> I think it's perfectly legal for gcl to say everything above 128 is
> alpha-char-p. I think, however, that people will just get confused
> that no such characters can be stored into a string and processed
> correctly as utf-8 without a bit of work.
>
> But perhaps this is just how 8-bit chars and utf-8 strings just have
> to work.
>
> I think 16-bit chars with utf-16 or 32-bit chars with utf-32 are far
> easier to explain.
>
> K.I.S.S?
>
> --
> Ray
>
>
>
>
>
>
>
> _______________________________________________
> Gcl-devel mailing list
> [email protected]
> https://lists.gnu.org/mailman/listinfo/gcl-devel
>
>
>
>

-- 
Camm Maguire                                        [email protected]
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

_______________________________________________
Gcl-devel mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/gcl-devel

Re: [Gcl-devel] utf8 and emacs text/string multibyte representation

Reply via email to