Re: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Antoniotti Marco Sun, 13 Apr 2014 03:35:33 -0700

Thank you Steve for your very good summary.

IMHO, what it needed is at least the kind of summary document saying what 
“sub-standard” external formats are available; you say that that would be 
possible and the only sensible thing to do at this time.  I agree and that 
would already be a good thing to have.


The issue of how and whether specifying the mapping to code points would break 
back compatibility is not clear to me, but then, my character coding fu and its 
ramifications are very weak.

Cheers
—
MA







On Apr 12, 2014, at 06:12 , Steve Haflich <[email protected]> wrote:

> 
> There is very little for a substandard to specify without overreaching (or 
> unnecessarily duplicating) other, more universal specifications.
> 
> Back when X3J13 spent a lot of time considering I18N, Unicode didn't yet 
> exist.  Unicode has been a big success in dealing with a very difficult 
> problem, and had existed, X3J13 probably would have specified it (if an 
> implementation chooses to support more than the set of basic chars) just as 
> Java did years later.  Further, UTF-8 didn't yet exist, but if it had, 
> implementations and perhaps even X3J13 would have adopted it as a default.  
> But that's the history of a different universe, and there exist some 
> historical implementations that have character code points different from 
> Unicode.
> 
> Every Common Lisp implementation implements a character set and maps those 
> characters onto nonnegative integer code points.  That mapping is not 
> specified, and although Unicode (or perhaps just its intersection with ASCII) 
> would be the sane choice in modern times.  But this has nothing to do with 
> external formats.  Unicode does not define externalization formats -- it 
> defines _only_ the mapping between zillions of characters (most not yet 
> existent) into nonegative integer code points in the range of 21 bits.  It 
> can and does do this without the blessing of the Lisp community.
> 
> UTF-8 defines a mapping of Unicode code points onto a sequence of octets.  It 
> was originally defined to support the encoding of arbitrary 32-bit 
> nonnegative integers onto sequences of 1 to 6 octets, but it was subsequently 
> tied closer to Unicode in that it is defined to support on the 21-bit Unicode 
> range, and also that certain code points (e.g. the surrogate pairs) are 
> defined to be errors.  (Much of this is explained understandably on the 
> Wikipedia UTF-8 page.)  So, UTF-8 is well defined and can work without the 
> blessing of the Lisp community.
> 
> So, if an implementation supports UTF-8 as an external format, it ought 
> translate whatever it uses for its internal code points into UTF-8 (which 
> represents, of course, Unicode code points).  Those internal code points are 
> not the business of any specification, and the UTF-8 translation is already 
> well defined by the Unicode and UTF-8 standards.
> 
> What's left?  Well, there is a little that could still be productively 
> substandardificated.  Specifically, the ANS punts nearly completely on what 
> can be used as the value of an :external-format argument.  So quasi-portable 
> code can't know what to specify if it wants to join the modern computing 
> community and read/write UTF-8.  I think the obvious answer if to draft a 
> substandard for a convention of :keyword names which an implementation ought 
> support for portability.  (Allegro does this, and I'd be happy to provide a 
> list of the many encodings and ef names that have been  supported for 
> decades.)  The most important one is of course :UTF-8, but semistandardizing 
> this along with the many ISO8859-nn encodings plus the several traditional 
> popular Japanese and Chinese encodings.  All these encodings are rapidly 
> falling out of usage, but there are historical web pages and other sources 
> that Common Lisp ought be able to internalize (for those implementations that 
> think this is important).
> 
> Other than external format naming, I can't think of anything that Common Lisp 
> needs to standardize.  Yes, the language would have been a better 
> programming-ecology citizen if code points were defined as Unicode, but that 
> would be back incompatible.
> 
> _______________________________________________
> pro mailing list
> [email protected]
> http://common-lisp.net/cgi-bin/mailman/listinfo/pro

--
Marco Antoniotti, Associate Professor                           tel.    +39 - 
02 64 48 79 01
DISCo, Università Milano Bicocca U14 2043               
http://bimib.disco.unimib.it
Viale Sarca 336
I-20126 Milan (MI) ITALY

Please note that I am not checking my Spam-box anymore.
Please do not forward this email without asking me first.






_______________________________________________
pro mailing list
[email protected]
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Re: [pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Reply via email to