Greetings, and thanks so much! I think we are converging... 1) The proposal under consideration is due to Carl, that gcl's lisp character still be governed by char-code-limit==256, i.e. equivalent to an uint8_t. aref/aset work the same for all types of arrays. This lisp character has no correspondence to a unicode character other than the overlap in the ascii range. In some fashion, gcl would then provide on top of these primitives (unichar s i), etc. to get unicodes from utf8 encoded strings. These are not random access, but can be cached. So (code-char #xa0) != no-break-space.
2) To the extent that anyone constructs unicode strings from individual codepoints, the use of these routines will ensure that the utf8 output is correct. An improperly formatted utf8-encoding will serve as valid input into aref, but not #'unichar, or whatever we call it. 3) There appear to be two meanings of alphabetic in the Hyperspec, which are not the same. The first is a graphic character with case, the second is a designator for constituent characters which do not separate tokens. alphabetic n., adj. 1. adj. (of a character) being one of the standard characters A through Z or a through z, or being any implementation-defined character that has case, or being some other graphic character defined by the implementation to be alphabetic[1]. 2. a. n. one of several possible constituent traits of a character. For details, see Section 2.1.4.1 (Constituent Characters) and Section 2.2 (Reader Algorithm). b. adj. (of a character) being a character that has syntax type constituent in the current readtable and that has the constituent trait alphabetic[2a]. See Figure 2-8. Defining octets >=128 as alpha-char-p means they are not used as token separators, at least in gcl's default reader. This makes sense for the input of a pair of octets representing no-break-space, as presumably this is a constituent character too. It does not make sense if one assumes that there must be distinct octets for the pair in no-break-space that correspond to the opposite case. So there appears to be a bit of an ambiguity here. Are there any non-constituent unicode codepoints in the non-ascii range? (Assuming yes, but probably not important.) 4) I think a dominant consideration here are the forms of most probable input and output. Files, terminals, even cut-paste from emacs buffers, all transfer valid utf8 encoded byte sequences into GCL which then intern, print, and string-compare correctly. Asking the unusual user who might want to set strings directly via their unicode codepoints to use a setf on unichar instead of aref, or better yet a unicode-char which outputs a string for concatenation, seems a small price to pay. Just thoughts... Take care, Raymond Toy <[email protected]> writes: >>>>>> "Matt" == Matt Kaufmann <[email protected]> writes: > > Matt> I saw your question and was curious, so I looked into it a bit: > >>> To your knowledge, is there any objection to defining alpha-char-p as > >>> including code-char's >= 128? > > Matt> I see that SBCL 1.2.2 is OK with that, for example: > > Matt> * (code-char 232) > > Matt> #\LATIN_SMALL_LETTER_E_WITH_GRAVE > Matt> * (alpha-char-p (code-char 232)) > > Matt> T > Matt> * > > Matt> In fact, that alpha-char-p call also returns T in (versions of) > Matt> Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL. > > Try (code-char #xa0). This is the unicode character > no-break-space. This has no case and would presumably not be > alpha-char-p. I think there are quite a few characters that would not > be (from cmucl): > > (count nil (loop for k from 128 upto 255 collect (alpha-char-p (code-char > k)))) > 63 > > > I think there is some confusion here, at least for me. If gcl uses > 8-bit code-units and utf-8 strings, what exactly is (coode-char 232)? > You can store that into a utf-8 string but it won't be > #\latin_small_letter_e_with_grave because that would be encoded as two > octets in a utf-8 string: 195 168. > > I think it's perfectly legal for gcl to say everything above 128 is > alpha-char-p. I think, however, that people will just get confused > that no such characters can be stored into a string and processed > correctly as utf-8 without a bit of work. > > But perhaps this is just how 8-bit chars and utf-8 strings just have > to work. > > I think 16-bit chars with utf-16 or 32-bit chars with utf-32 are far > easier to explain. > > K.I.S.S? > > -- > Ray > > > > > > > > _______________________________________________ > Gcl-devel mailing list > [email protected] > https://lists.gnu.org/mailman/listinfo/gcl-devel > > > > -- Camm Maguire [email protected] ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah _______________________________________________ Gcl-devel mailing list [email protected] https://lists.gnu.org/mailman/listinfo/gcl-devel
