I finally pushed this change to "master", along with a few other minor patches.
I deleted the clause that called windows-1252 a superset of ISO-8559-1. Thanks for that comment. John Darrington <[email protected]> writes: > This seems to cover everything. > > A purist might object to calling windows-1252 a "superset" of iso-8859-1 ... > they are just two different encodings, which happen to have large parts of > they're mappings identical. > > J' > > On Mon, Jan 03, 2011 at 10:45:12AM -0800, Ben Pfaff wrote: > > I think you've told me all of this before. It's time to write it > down. Here's what I have as an update to > system-file-format.texi. Can you look it over and verify that it > looks accurate? Also, if you have any system files locally that > have other codepage numbers not already mentioned, please let me > know which ones and I'll add them to the list. > > --8<--------------------------cut here-------------------------->8-- > > From: Ben Pfaff <[email protected]> > Date: Mon, 3 Jan 2011 10:43:21 -0800 > Subject: [PATCH] doc: Update description of character encoding > information in system files. > > Based on information provided by John Darrington and on system files > obtained freely from the Internet. > --- > doc/dev/system-file-format.texi | 66 > +++++++++++++++++++++++++++++++++------ > 1 files changed, 56 insertions(+), 10 deletions(-) > > diff --git a/doc/dev/system-file-format.texi > b/doc/dev/system-file-format.texi > index 972b133..bf376b5 100644 > --- a/doc/dev/system-file-format.texi > +++ b/doc/dev/system-file-format.texi > @@ -549,14 +549,46 @@ Compression code. Always set to 1. > Machine endianness. 1 indicates big-endian, 2 indicates little-endian. > > @item int32 character_code; > -@anchor{character-code} > -Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3 > -indicates 8-bit ASCII, 4 indicates DEC Kanji. > -Windows code page numbers are also valid. > - > -Experience has shown that in many files, this field is ignored or > incorrect. > -For a more reliable indication of the file's character encoding > -see @ref{Character Encoding Record}. > +@anchor{character-code} Character code. The following values have > +been actually observed in system files: > + > +@table @asis > +@item 2 > +7-bit ASCII. > + > +@item 1250 > +The @code{windows-1250} code page for Central European and Eastern > +European languages. > + > +@item 1252 > +The @code{windows-1252} code page for Western European languages, a > +superset of ISO 8859-1. > + > +@item 28591 > +ISO 8859-1. > + > +@item 65001 > +UTF-8. > +@end table > + > +The following additional values are known to be defined: > + > +@table @asis > +@item 1 > +EBCDIC. > + > +@item 3 > +8-bit ``ASCII''. > + > +@item 4 > +DEC Kanji. > +@end table > + > +Other Windows code page numbers are known to be generally valid. > + > +Old versions of SPSS always wrote value 2 in this field, regardless of > +the encoding in use. Newer versions also write the character encoding > +as a string (see @ref{Character Encoding Record}). > @end table > > @node Machine Floating-Point Info Record > @@ -959,8 +991,22 @@ The name of the character encoding. Normally this > will be an official IANA char > See @url{http://www.iana.org/assignments/character-sets}. > @end table > > -This record is not present in files generated by older software. > -See also @ref{character-code}. > +This record is not present in files generated by older software. See > +also the @code{character_code} field in the machine integer info > +record (@pxref{character-code}). > + > +When the character encoding record and the machine integer info record > +are both present, all system files observed in practice indicate the > +same character encoding, e.g.@: 1252 as @code{character_code} and > +@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc. > + > +If, for testing purposes, a file is crafted with different > +@code{character_code} and @code{encoding}, it seems that > +@code{character_code} controls the encoding for all strings in the > +system file before the dictionary termination record, including > +strings in data (e.g.@: string missing values), and @code{encoding} > +controls the encoding for strings following the dictionary termination > +record. > > @node Long String Value Labels Record > @section Long String Value Labels Record > -- > 1.7.1 > > > -- > Ben Pfaff > http://benpfaff.org -- Ben Pfaff http://benpfaff.org _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
