Re: [asdf-devel] source file encoding

Douglas Crosher Mon, 09 Apr 2012 08:34:05 -0700

Attached is my suggestion for adding external-format support.

* A table of translations is included, based on asdf-encodings, but if not 
found then the external-format is passed through.  The
intention is to increase the range of aliases supported to make it easier to 
write portable system definitions.  It is not the
intention that the list be exhaustive or that any attempt be made to verify the 
encoding.


* It uses :external-format.  Users will be working with external-formats, 
perhaps for a foreign CL implementation but still
external-formats.  Introducing new terminology of 'encoding' seems a mistake.

* No attempt is made to verify the external format.  This is does seem 
necessary and not even possible.

* A declarative system definition can be used for both portable :utf-8 and 
implementation dependant (non-portable) external-formats.
There is no need to add code methods or extend asdf-encodings to use user 
defined or implementation dependant external formats.
Supporting declarative definitions has many advantages over the alternative of 
requiring asdf-encoding code or asdf methods to
support user defined or implementation dependant external-formats.

* The default is :default.   The external-format support in ASDF would seem to 
be needed to write 'portable' libraries with UTF-8
source files so it will not be possible until users have upgraded anyway.  
Portability is not gained now by making :utf-8 the
default, so I just don't see the advantage of making :utf-8 the default when 
this would break backward compatibility and make
migration problematic and run contra to the ANSI CL standard.

* At less than 200 lines of code it is just included in asdf.lisp.

Regards
Douglas Crosher


On 04/09/2012 12:36 AM, Faré wrote:
> Abstract:
> I think requiring a few marginal hackers doing weird things
> to specifiy :encoding :default
> is a small price to pay for everyone to be able to specify
> their encoding in a portable way, with a sane default
> that is already almost universally accepted (i.e. :utf-8).
> 
> On Sun, Apr 8, 2012 at 07:31, Douglas Crosher <[email protected]> wrote:
>> The portable-hemlock is still maintained and was updated a few months ago to 
>> avoid the use of non-ascii characters in the source so
>> it builds cleanly with UTF-8 as the input external-format.  The code is not 
>> in great shape, but is being improved.  See:
>> http://gitorious.org/hemlock/pages/Home
>>
> Oh, I hadn't noticed this new page for hemlock.
> Is CMUCL using the portable hemlock these days, or still including its own?
> 
>> Even if you get all the quick lisp projects converted to be UTF-8 clean,
>> this still represents a subset of ASDF users.  I wish you
>> would reconsider these changes to ASDF because I fear it is divisive.
>>
> Well, I recognize that not all code is in Quicklisp and that
> there is a need for a backward compatibility mode.
> Putting :encoding :default in your defsystem will achieve just that.
> 
> At the same time, if :encoding :default
> rather than :encoding :utf-8 were the default,
> then we'd gain nothing, and it would still be a horrible mess
> to ascertain which system has been compiled with which encoding.
> 
>> It is not reasonable to expect users of ASDF to hack on
>> external support code just to use non-UTF-8 external-formats,
>> and the external library you plan for can never be complete because
>> the external-format is user extensible.
>>
> Well, on the one hand, for portability's sake,
> one should probably one's lisp file to a universally supported external 
> format.
> On the other hand, where portability is not a problem,
> one can either use :encoding :default and be back to the current semantics,
> or extend asdf-encodings as one extends external formats.
> 
>> ASDF could easily be flexible regarding the external-format
>> and not a limited bastion of portable open source code.
>>
> Agreed. Currently, ASDF is not flexible at all -- rather it is uncontrolled.
> 
>>  It would be very easy and workable to just name this :external-format,
>> and to pass through encodings not recognised - all the quicklisp projects
>> would work just fine using :utf-8
>> and other CL users could use encodings as needed.
> Unhappily, passing through external formats is not portable,
> if only for CLISP.
> But if you're doing non-portable things,
> you can keep doing whatever you were previously doing
> with :encoding :default,
> or you can now define methods on asdf::component-external-format
> to do whatever you want, to override the default behavior of
> checking *encoding-external-format-hook*.
> Or then again, you can extend asdf-encodings to make it smarter.
> 
> In practice, how many people do you know who use a non-UTF-8 encoding,
> and how many of them will be majorly annoyed by having to either
> recode their source, explicitly specify their encoding,
> or add :encoding :default to preserve backwards compatibility?
> 
> —♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org
> "I've finally learned what `upward compatible' means.  It means we
>  get to keep all our old mistakes."
>  — Dennie van Tassel
>

diff --git a/asdf.lisp b/asdf.lisp
index 946423c..9279c29 100644
--- a/asdf.lisp
+++ b/asdf.lisp
@@ -270,9 +270,7 @@
             #:component-parent
             #:component-property
             #:component-system
-            #:*utf-8-external-format*
-            #:component-encoding
-            #:*encoding-external-format-hook*
+            #:component-external-format
 
             #:component-depends-on
 
@@ -960,8 +958,6 @@ another pathname in a degenerate way."))
 
 (defgeneric* component-external-format (component))
 
-(defgeneric* component-encoding (component))
-
 (eval-when (#-gcl :compile-toplevel :load-toplevel :execute)
   (defgeneric* (setf module-components-by-name) (new-value module)))
 
@@ -1184,7 +1180,7 @@ processed in order by OPERATE."))
    (operation-times :initform (make-hash-table)
                     :accessor component-operation-times)
    (around-compile :initarg :around-compile)
-   (%encoding :accessor %component-encoding :initform nil :initarg :encoding)
+   (external-format :initform nil :initarg :external-format)
    ;; XXX we should provide some atomic interface for updating the
    ;; component properties
    (properties :accessor component-properties :initarg :properties
@@ -1295,41 +1291,205 @@ processed in order by OPERATE."))
               (acons property new-value (slot-value c 'properties)))))
   new-value)
 
-(defparameter *utf-8-external-format*
-  #+(and asdf-unicode (not clisp)) :utf-8
-  #+(and asdf-unicode clisp) charset:utf-8
-  #-asdf-unicode :default
-  "Default :external-format argument to pass to CL:OPEN and also
-CL:LOAD or CL:COMPILE-FILE to best process a UTF-8 encoded file.
-On modern implementations, this will decode UTF-8 code points as CL characters.
-On legacy implementations, we may fall back on some 8-bit encoding,
-with non-ASCII code points being read as several CL characters;
-hopefully, if done consistently, it won't affect program behavior too much.")
-
-(defmethod component-encoding ((c component))
-  (or (%component-encoding c)
-      (aif (component-parent c)
-           (component-encoding it)
-           :utf-8)))
-
-(defun default-encoding-external-format (encoding)
-  (case encoding
-    (:utf-8 *utf-8-external-format*)
-    (:default :default) ;; for backwards compatibility only. Usage discouraged.
-    (otherwise
-     (cerror "Continue using :external-format :default" (compatfmt "~@<Your ASDF component is using encoding ~S but it isn't recognized. Your system should :defsystem-depends-on (:asdf-encodings).~:>") encoding)
-     :default)))
 
-(defvar *encoding-external-format-hook*
-  #'default-encoding-external-format
-  "Hook for an extension to define a mapping between non-default encodings
-and implementation-defined external-format's")
-
-(defun encoding-external-format (encoding)
-  (funcall *encoding-external-format-hook* encoding))
+;;; Translations of external formats to those recognised by the current CL
+;;; implementation.  This is intended to extend to range of aliases recognised
+;;; to help writing portable system definitions.  Often the same keyword is
+;;; recognised by many CL implementations and this makes an obvious choice.
+;;; This will often be the IANA registered name with dashes rather than
+;;; underscores, see: http://www.iana.org/assignments/character-sets, or names
+;;; known to iconv or libiconv.  To help simplify the table a resulting
+;;; translation symbol is searched for in the :charset package on CLISP, and a
+;;; symbol named 'cp-nnn' is translated to a win32:code-page on Lispworks
+;;; windows.
+(defparameter *external-format-translations*
+  '((:utf-8 :utf8 :u8) ; our preferred default, environment-independent.
+    (:ascii :us-ascii :iso-646-us :ansi_x3.4-1968) ; in practice the lowest common denominator
+    (:iso-646 :|646|) ; even lower common denominator for old international encodings
+    ;;; ISO/IEC 8859
+    #-lispworks
+    (:iso-8859-1 :iso8859-1 :latin1 :latin-1 :l1 ; direct mapping to first 256 unicode characters
+     :iso_8859-1 :iso-ir-100 :csisolatin1 :ibm819 :cp819 :windows-28591)
+    #+lispworks
+    (:latin-1 :iso-8859-1 :iso8859-1 :latin1 :l1 ; direct mapping to first 256 unicode characters
+     :iso_8859-1 :iso-ir-100 :csisolatin1 :ibm819 :cp819 :windows-28591)
+    (:iso-8859-2 :iso8859-2 :latin2 :latin-2) ; eastern european; not to be confused with dos-cp852
+    (:iso-8859-3 :iso8859-3 :latin3 :latin-3) ; esperanto, maltese, (turkish)
+    (:iso-8859-4 :iso8859-4 :latin4 :latin-4) ; prefer latin6, utf-8.
+    (:iso-8859-5 :iso8859-5) ; cyrillic; prefer koi8-r, or utf-8
+    (:iso-8859-6 :iso8859-6) ; arabic
+    (:iso-8859-7 :iso8859-7 :ecma-118) ; greek
+    (:iso-8859-8 :iso8859-8) ; hebrew
+    (:iso-8859-9 :iso8859-9 :latin5 :latin-5) ; turkish
+    (:iso-8859-10 :iso8859-10 :latin6 :latin-6) ; nordic languages
+    (:iso-8859-11 :iso8859-11) ; almost same as TIS 620 which is preferred for thai
+    ;;(:iso-8859-12 :iso8859-12) ; abandoned. Was meant for devanagari. Use
+    (:iso-8859-13 :iso8859-13 :latin7 :latin-7) ; baltic rim
+    (:iso-8859-14 :iso8859-14 :latin8 :latin-8) ; celtic
+    (:iso-8859-15 :iso8859-15 :latin9 :latin-9) ; formerly also latin0. Tweak of latin1.
+    (:iso-8859-16 :iso8859-16 :latin10 :latin-10) ; south-eastern european
+    ;;; Windows code pages
+    (:cp1250 :windows-1250 :windows-cp1250 :cp-1250 :ms-ee) ; eastern european
+    (:cp1251 :windows-1251 :windows-cp1251 :cp-1251 :ms-cyrl) ; russian
+    (:cp1252 :windows-1252 :windows-cp1252 :cp-1252 :ms-ansi :windows-latin1) ; superset of latin1
+    (:cp1253 :windows-1253 :windows-cp1253 :cp-1253 :ms-greek) ; not quite iso-8859-7; prefer UTF-8
+    (:cp1254 :windows-1254 :windows-cp1254 :cp-1254 :ms-turk) ; superset of iso-8859-9; prefer UTF-8
+    (:cp1255 :windows-1255 :windows-cp1255 :cp-1255 :ms-hebr) ; mostly iso-8859-8; prefer UTF-8
+    (:cp1256 :windows-1256 :windows-cp1256 :cp-1256 :ms-arab) ; incompatible w/ iso-8859-6; use UTF-8
+    (:cp1257 :windows-1257 :windows-cp1257 :cp-1257 :ms-baltic :winbaltrim) ; prefer UTF-8
+    (:cp1258 :windows-1258 :windows-cp1258 :cp-1258 :ms-viet) ; vietnamese combining. Prefer UTF-8.
+    (:cp932 :windows-cp932 :cp-932 :windows-31j) ; Microsoft extension of Shift-JIS
+    (:cp936 :windows-cp936 :cp-936) ; Simplified Chinese. Extends GB2312 with most of GBK. Use 54936.
+    (:cp949 :windows-cp949 :cp-949) ; variant of EUC-KR. Use UTF-8
+    (:cp950 :windows-cp950 :cp-950) ; Microsoft variant of Big5
+    ;;; DOS code pages
+    ;;; For some of these CLISP has both "ms" and "ibm" variants
+    ;;; as in cp437 and cp437-ibm. We write CLISP!? next to them.
+    (:cp437 :dos-cp437 :cp-437) ; Original IBM PC character set. CLISP!?
+    (:cp737 :dos-cp737 :cp-737) ; Greek.
+    (:cp775 :dos-cp775 :cp-775) ; Estonian, Lithuanian and Latvian
+    (:cp850 :dos-cp850 :cp-850) ; Western Europe. Default MS-DOS code page in Windows 95.
+    (:cp852 :dos-cp852 :cp-852) ; Central Europe. CLISP!?
+    (:cp855 :dos-cp855 :cp-855) ; Russian. Not used much.
+    (:cp856 :dos-cp856 :cp-856) ; Yet another russian code page.
+    (:cp857 :dos-cp857 :cp-857) ; Turkish.
+    (:cp860 :dos-cp860 :cp-860) ; Portuguese. CLISP!?
+    (:cp861 :dos-cp861 :cp-861) ; Icelandic. CLISP!?
+    (:cp862 :dos-cp862 :cp-862) ; Hebrew. CLISP!?
+    (:cp863 :dos-cp863 :cp-863) ; French (mainly used in Quebec). CLISP!?
+    (:cp864 :dos-cp864 :cp-864) ; Arabic. CLISP!?
+    (:cp865 :dos-cp865 :cp-865) ; Nordic except Icelandic. CLISP!?
+    (:cp866 :dos-cp866 :cp-866) ; Russian. Once popular.
+    (:cp869 :dos-cp869 :cp-869 :dos-greek-2) ; failed before 737. CLISP!?
+    (:cp874 :dos-cp874 :cp-874) ; Thai. Extension of TIS-620. CLISP!?
+    (:cp1133 :cp-1133) ; IBM code page for lao
+    ;;; Mac code pages
+    #-lispworks
+    (:macintosh :mac-roman :macos-roman)
+    #+lispworks
+    (:macos-roman :macintosh :mac-roman)
+    (:mac-arabic)
+    #+clisp (:mac-central-europe :mac-centraleurope :mac-latin2) ; mac-latin2 name used by cmucl
+    #-clisp (:mac-centraleurope :mac-central-europe :mac-latin2) ; mac-latin2 name used by cmucl
+    (:mac-croatian)
+    (:mac-cyrillic :x-mac-cyrillic) ; sbcl calls it x-mac-cyrillic
+    (:mac-greek)
+    (:mac-hebrew)
+    (:mac-icelandic :mac-iceland)
+    (:mac-romania)
+    (:mac-thai) ; extension of TIS-620.
+    (:mac-turkish)
+    (:mac-ukraine)
+    (:mac-dingbat)
+    (:mac-symbol)
+    ;;; Implementation-specific hacks
+    ;;(:beta-gk) ; CMUCL: ASCII encoding of ancient Greek
+    ;;(:final-sigma) ; CMUCL: tweak final sigmas in greek (composable)
+    ;;(:base64) ;; CLISP: base64-encoded latin1? composable?
+    ;;(:cr :mac) (:crlf :dos) ; CMUCL: line ending tweaks (composable)
+    ;; CJK character sets
+    (:big5)
+    (:big5-hkscs)
+    (:euc-cn :euccn)
+    (:euc-jp :eucjp)
+    (:euc-kr :euckr)
+    (:euc-tw :euctw)
+    (:gbk) ; de facto standard of communist china, extends gb2312
+    (:gb18030 :cp54936) ; official character set of communist china, extends gb2312 and gbk
+    ;; ECMA-35 is same as ISO-2022, and free.
+    (:iso-2022-jp) ; rfc 1468
+    (:iso-2022-jp-1) ; rfc 2237
+    (:iso-2022-jp-2) ; rfc 1554
+    (:iso-2022-jp-3)
+    (:iso-2022-jp-2004)
+    (:iso-2022-kr) ; rfc 1557
+    (:iso-2022-cn) (:iso-2022-cn-ext) ; both rfc-1922
+    (:jis-x0201 :jisx0201 :jis_x0201) ; phonetic japanese katakana
+    (:jis-x0208 :jisx0208 :jis_x0208)
+    (:jis-x0212 :jisx0212 :jis_x0212)
+    #-lispworks
+    (:shift-jis :sjis)
+    #+lispworks
+    (:sjis :shift-jis)
+    (:johab :ksc-5601 :ksc5601) ; korean
+    ;;; Various National character sets
+    (:armscii-8) ; armenian
+    (:georgian-academy)
+    (:georgian-ps)
+    (:koi8-r :koi8r :cp-1866 :cp1866) ; russian
+    (:koi8-u :koi8u :cp-21866 :cp21866) ; ukrainian
+    (:tis-620) ; thai
+    (:tcvn) ; viet
+    (:viscii) ; viet
+    ;;; Other computer-specific sets
+    (:atascii :atarist) ; still supported by ECL
+    (:hp-roman8) ; used on HP-UX
+    ;;; EBCDIC
+    (:cp037 :cp-037 :ibm037 :ibm-037 :ebcdic-us) ; latin1
+    (:utf-ebcdic) ; utf-8
+    ;;; Unicode encodings beside utf-8
+    (:utf-7 :utf7) ; seldom used magic format for email
+    (:cesu-8) ; kind of utf-8 encoded utf-16 (ugh). What oracle miscalls utf-8.
+    (:java) ; java's modified UTF-8, like CESU-8 but with special encoding of U+0000.
+    #+lispworks
+    (:ucs-2 :ucs2) ; only BMP, may be either of the below
+    #+lispworks
+    (:unicode :ucs-2)
+    #-lispworks
+    (:ucs-2le :ucs-2-le :ucs2le)
+    #+lispworks
+    ((:unicode :little-endian t) :ucs-2le :ucs-2-le :ucs2le)
+    #-lispworks
+    (:ucs-2be :ucs-2-be :ucs2be)
+    #+lispworks
+    ((:unicode :little-endian nil) :ucs-2be :ucs-2-be :ucs2be)
+    (:utf-16 :utf16) ; may be either of the below
+    #-clisp
+    (:utf-16be :utf16be :utf16-be :unicode-16-big-endian)
+    #+clisp
+    (:unicode-16-big-endian :utf-16be :utf16be :utf16-be)
+    #-clisp
+    (:utf-16le :utf16le :utf16-le :unicode-16-little-endian)
+    #+clisp
+    (:unicode-16-little-endian :utf-16le :utf16le :utf16-le)
+    #-clisp
+    (:utf-32 :utf32 :ucs-4 :ucs4 :unicode-32)
+    #+clisp
+    (:unicode-32 :utf-32 :utf32 :ucs-4 :ucs4)
+    #-clisp
+    (:utf-32be :utf32be :utf-32-be :utf32-be :ucs-4-be :ucs-4be :unicode-32-big-endian)
+    #+clisp
+    (:unicode-32-big-endian :utf-32be :utf32be :utf-32-be :utf32-be :ucs-4-be :ucs-4be)
+    #-clisp
+    (:utf-32le :utf32le :utf-32-le :utf32-le :ucs-4-le :ucs-4le :unicode-32-little-endian)
+    #+clisp
+    (:unicode-32-little-endian :utf-32le :utf32le :utf-32-le :utf32-le :ucs-4-le :ucs-4le)
+    ))
+
+(defun translate-external-format (external-format)
+  (dolist (translation-list *external-format-translations* external-format)
+    (when (find external-format translation-list :test 'equalp)
+      (let ((translation (first translation-list)))
+	#+clisp
+	(when (symbolp translation)
+	  (setf translation (find-symbol* translation :charset)))
+	#+(and lispworks windows)
+	(when (symbolp translation)
+	  (let* ((s (string translation))
+		 (i (and (< 3 (length s)) (string-equal "cp-" s :end2 3)
+			 (multiple-value-bind (i l)
+			     (parse-integer s :start 3 :junk-allowed t)
+			   (and i (= l (length s)) i)))))
+	    (when i
+	      (setf translation `(win32:code-page :id ,i)))))
+	(when translation
+	  (return translation))))))
 
 (defmethod component-external-format ((c component))
-  (encoding-external-format (component-encoding c)))
+  (or (slot-value c 'external-format)
+      (aif (component-parent c)
+           (component-external-format it)
+           :default)))
 
 (defclass proto-system () ; slots to keep when resetting a system
   ;; To preserve identity for all objects, we'd need keep the components slots
@@ -2407,7 +2567,8 @@ recursive calls to traverse.")
          c #'(lambda ()
                (apply *compile-op-compile-file-function* source-file
                       :output-file output-file
-                      :external-format (component-external-format c)
+                      :external-format (translate-external-format
+					(component-external-format c))
                       (compile-op-flags operation))))
       (unless output
         (error 'compile-error :component c :operation operation))
@@ -2515,7 +2676,9 @@ recursive calls to traverse.")
   (let ((source (component-pathname c)))
     (setf (component-property c 'last-loaded-as-source)
           (and (call-with-around-compile-hook
-                c #'(lambda () (load source :external-format (component-external-format c))))
+                c #'(lambda ()
+		      (load source :external-format (translate-external-format
+						     (component-external-format c)))))
                (get-universal-time)))))
 
 (defmethod perform ((operation load-source-op) (c static-file))

_______________________________________________
asdf-devel mailing list
[email protected]
http://lists.common-lisp.net/cgi-bin/mailman/listinfo/asdf-devel

Re: [asdf-devel] source file encoding

Reply via email to