Re: even more about character encoding names

Ben Pfaff Sat, 05 Feb 2011 13:20:13 -0800

I finally pushed this change to "master", along with a few other
minor patches.


I deleted the clause that called windows-1252 a superset of
ISO-8559-1.  Thanks for that comment.

John Darrington <[email protected]> writes:

> This seems to cover everything.  
>
> A purist might object to calling windows-1252 a "superset" of iso-8859-1 ... 
> they are just two different encodings, which happen to have large parts of 
> they're mappings identical.
>
> J'
>
> On Mon, Jan 03, 2011 at 10:45:12AM -0800, Ben Pfaff wrote:
>      
>      I think you've told me all of this before.  It's time to write it
>      down.  Here's what I have as an update to
>      system-file-format.texi.  Can you look it over and verify that it
>      looks accurate?  Also, if you have any system files locally that
>      have other codepage numbers not already mentioned, please let me
>      know which ones and I'll add them to the list.
>      
>      --8<--------------------------cut here-------------------------->8--
>      
>      From: Ben Pfaff <[email protected]>
>      Date: Mon, 3 Jan 2011 10:43:21 -0800
>      Subject: [PATCH] doc: Update description of character encoding 
> information in system files.
>      
>      Based on information provided by John Darrington and on system files
>      obtained freely from the Internet.
>      ---
>       doc/dev/system-file-format.texi |   66 
> +++++++++++++++++++++++++++++++++------
>       1 files changed, 56 insertions(+), 10 deletions(-)
>      
>      diff --git a/doc/dev/system-file-format.texi 
> b/doc/dev/system-file-format.texi
>      index 972b133..bf376b5 100644
>      --- a/doc/dev/system-file-format.texi
>      +++ b/doc/dev/system-file-format.texi
>      @@ -549,14 +549,46 @@ Compression code.  Always set to 1.
>       Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
>       
>       @item int32 character_code;
>      -@anchor{character-code}
>      -Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
>      -indicates 8-bit ASCII, 4 indicates DEC Kanji.
>      -Windows code page numbers are also valid.
>      -
>      -Experience has shown that in many files, this field is ignored or 
> incorrect.
>      -For a more reliable indication of the file's character encoding
>      -see @ref{Character Encoding Record}.
>      +@anchor{character-code} Character code.  The following values have
>      +been actually observed in system files:
>      +
>      +@table @asis
>      +@item 2
>      +7-bit ASCII.
>      +
>      +@item 1250
>      +The @code{windows-1250} code page for Central European and Eastern
>      +European languages.
>      +
>      +@item 1252
>      +The @code{windows-1252} code page for Western European languages, a
>      +superset of ISO 8859-1.
>      +
>      +@item 28591
>      +ISO 8859-1.
>      +
>      +@item 65001
>      +UTF-8.
>      +@end table
>      +
>      +The following additional values are known to be defined:
>      +
>      +@table @asis
>      +@item 1
>      +EBCDIC.
>      +
>      +@item 3
>      +8-bit ``ASCII''.
>      +
>      +@item 4
>      +DEC Kanji.
>      +@end table
>      +
>      +Other Windows code page numbers are known to be generally valid.
>      +
>      +Old versions of SPSS always wrote value 2 in this field, regardless of
>      +the encoding in use.  Newer versions also write the character encoding
>      +as a string (see @ref{Character Encoding Record}).
>       @end table
>       
>       @node Machine Floating-Point Info Record
>      @@ -959,8 +991,22 @@ The name of the character encoding.  Normally this 
> will be an official IANA char
>       See @url{http://www.iana.org/assignments/character-sets}.
>       @end table
>       
>      -This record is not present in files generated by older software.
>      -See also @ref{character-code}.
>      +This record is not present in files generated by older software.  See
>      +also the @code{character_code} field in the machine integer info
>      +record (@pxref{character-code}).
>      +
>      +When the character encoding record and the machine integer info record
>      +are both present, all system files observed in practice indicate the
>      +same character encoding, e.g.@: 1252 as @code{character_code} and
>      +@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
>      +
>      +If, for testing purposes, a file is crafted with different
>      +@code{character_code} and @code{encoding}, it seems that
>      +@code{character_code} controls the encoding for all strings in the
>      +system file before the dictionary termination record, including
>      +strings in data (e.g.@: string missing values), and @code{encoding}
>      +controls the encoding for strings following the dictionary termination
>      +record.
>       
>       @node Long String Value Labels Record
>       @section Long String Value Labels Record
>      -- 
>      1.7.1
>      
>      
>      -- 
>      Ben Pfaff 
>      http://benpfaff.org

-- 
Ben Pfaff 
http://benpfaff.org

_______________________________________________
pspp-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: even more about character encoding names

Reply via email to