I'm working on a ZSAV implementation. Since users seem eager for this, here's a first draft. It reads all the ZSAV files I've encountered so far. It needs some tests and probably a writer implementation. Those will take a few days.
--8<--------------------------cut here-------------------------->8-- From: Ben Pfaff <b...@cs.stanford.edu> Date: Tue, 15 Oct 2013 00:14:01 -0700 Subject: [PATCH] Work on ZSAV implementation. --- doc/dev/system-file-format.texi | 182 ++++++++++++++++++--- src/data/sys-file-private.h | 14 +- src/data/sys-file-reader.c | 265 ++++++++++++++++++++++++++++--- src/data/sys-file-reader.h | 10 +- src/language/dictionary/sys-file-info.c | 7 +- utilities/pspp-dump-sav.c | 139 ++++++++++++++-- 6 files changed, 558 insertions(+), 59 deletions(-) diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index f408ff2..fc9a455 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -56,6 +56,18 @@ appears in system files only in missing value ranges, which never contain SYSMIS. @end table +System files may use most character encodings based on an 8-bit unit. +UTF-16 and UTF-32, based on wider units, appear to be unacceptable. +@code{rec_type} in the file header record is sufficient to distinguish +between ASCII and EBCDIC based encodings. The best way to determine +the specific encoding in use is to consult the character encoding +record (@pxref{Character Encoding Record}), if present, and failing +that the @code{character_code} in the machine integer info record +(@pxref{Machine Integer Info Record}). The same encoding should be +used for the dictionary and the data in the file, although it is +possible to artificially synthesize files that use different encodings +(@pxref{Character Encoding Record}). + System files are divided into records, each of which begins with a 4-byte record type, usually regarded as an @code{int32}. @@ -121,7 +133,7 @@ char rec_type[4]; char prod_name[60]; int32 layout_code; int32 nominal_case_size; -int32 compressed; +int32 compression; int32 weight_index; int32 ncases; flt64 bias; @@ -133,9 +145,17 @@ char padding[3]; @table @code @item char rec_type[4]; -Record type code, set to @samp{$FL2}, that is, either @code{24 46 4c -32} if the file uses an ASCII-based character encoding, or @code{5b c6 -d3 f2} if the file uses an EBCDIC-based character encoding. +Record type code, either @samp{$FL2} for system files with +uncompressed data or data compressed with simple bytecode compression, +or @samp{$FL3} for system files with ZLIB compressed data. + +This is truly a character field that uses the character encoding as +other strings. Thus, in a file with an ASCII-based character encoding +this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a +file with an EBCDIC-based encoding this field contains @code{5b c6 d3 +f2}. (SPSS documentation states that ZLIB-compressed files must be +encoded in UTF-8, so EBCDIC-based ZLIB-compressed files presumably do +not exist.) @item char prod_name[60]; Product identification string. This always begins with the characters @@ -160,7 +180,10 @@ files written by some systems set this value to -1. In general, it is unsafe for systems reading system files to rely upon this value. @item int32 compressed; -Set to 1 if the data in the file is compressed, 0 otherwise. +Set to 0 if the data in the file is not compressed, 1 if the data is +compressed with simple bytecode compression, 2 if the data is ZLIB +compressed. This field has value 2 if and only if @code{rec_type} is +@samp{$FL3}. @item int32 weight_index; If one of the variables in the data set is used as a weighting @@ -577,7 +600,8 @@ Floating point representation code. For IEEE 754 systems this is 1. IBM 370 sets this to 2, and DEC VAX E to 3. @item int32 compression_code; -Compression code. Always set to 1. +Compression code. Always set to 1, regardless of whether or how the +file is compressed. @item int32 endianness; Machine endianness. 1 indicates big-endian, 2 indicates little-endian. @@ -1434,22 +1458,23 @@ Ignored padding. Should be set to 0. @node Data Record @section Data Record -Data records must follow all other records in the system file. There must -be at least one data record in every system file. - -The format of data records varies depending on whether the data is -compressed. Regardless, the data is arranged in a series of 8-byte -elements. +The data record must follow all other records in the system file. +Every system file must have a data record that specifies data for at +least one case. The format of the data record varies depending on the +value of @code{compression} in the file header record: -When data is not compressed, -each element corresponds to +@table @asis +@item 0: no compression +Data is arranged as a series of 8-byte elements. +Each element corresponds to the variable declared in the respective variable record (@pxref{Variable Record}). Numeric values are given in @code{flt64} format; string values are literal characters string, padded on the right when necessary to fill out 8-byte units. -Compressed data is arranged in the following manner: the first 8 bytes -in the data section is divided into a series of 1-byte command +@item 1: bytecode compression +The first 8 bytes +of the data record is divided into a series of 1-byte command codes. These codes have meanings as described below: @table @asis @@ -1487,8 +1512,125 @@ An 8-byte string value that is all spaces. The system-missing value. @end table -When the end of the an 8-byte group of command bytes is reached, any -blocks of non-compressible values indicated by code 253 are skipped, -and the next element of command bytes is read and interpreted, until -the end of the file or a code with value 252 is reached. +The end of the 8-byte group of bytecodes is followed by any 8-byte +blocks of non-compressible values indicated by code 253. After that +follows another 8-byte group of bytecodes, then those bytecodes' +non-compressible values. The pattern repeats to the end of the file +or a code with value 252. + +@item 2: ZLIB compression +The data record consists of the following, in order: + +@itemize @bullet +@item +ZLIB data header, 24 bytes long. + +@item +One or more variable-length blocks of ZLIB compressed data. + +@item +ZLIB data trailer, with a 24-byte fixed header plus an additional 24 +bytes for each preceding ZLIB compressed data block. +@end itemize + +The ZLIB data header has the following format: + +@example +int64 zheader_ofs; +int64 ztrailer_ofs; +int64 ztrailer_len; +@end example + +@table @code +@item int64 zheader_ofs; +The offset, in bytes, of the beginning of this structure within the +system file. + +@item int64 ztrailer_ofs; +The offset, in bytes, of the first byte of the ZLIB data trailer. + +@item int64 ztrailer_len; +The number of bytes in the ZLIB data trailer. This and the previous +field sum to the size of the system file in bytes. +@end table + +The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB +compressed data blocks. Each ZLIB compressed data block begins with a +ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78 +01} (the only header yet observed in practice). Each block +decompresses to a fixed number of bytes (in practice only +@code{0x3ff000}-byte blocks have been observed), except that the last +block of data may be shorter. The last ZLIB compressed data block +ends just before offset @code{ztrailer_ofs}. + +The result of ZLIB decompression is bytecode compressed data as +described above for compression format 1. + +The ZLIB data trailer begins with the following 24-byte fixed header: + +@example +int64 bias; +int64 zero; +int32 block_size; +int32 n_blocks; +@end example + +@table @code +@item int64 int_bias; +The compression bias as a negative integer, e.g.@: if @code{bias} in +the file header record is 100.0, then @code{int_bias} is @minus{}100 +(this is the only value yet observed in practice). + +@item int64 zero; +Always observed to be zero. + +@item int32 block_size; +The number of bytes in each ZLIB compressed data block, except +possibly the last, following decompression. Only @code{0x3ff000} has +been observed so far. + +@item int32 n_blocks; +The number of ZLIB compressed data blocks, always exactly +@code{(ztrailer_ofs - 24) / 24}. +@end table + +The fixed header is followed by @code{n_blocks} 24-byte ZLIB data +block descriptors, each of which describes the compressed data block +corresponding to its offset. Each block descriptor has the following +format: + +@example +int64 uncompressed_ofs; +int64 compressed_ofs; +int32 uncompressed_size; +int32 compressed_size; +@end example + +@table @code +@item int64 uncompressed_ofs; +The offset, in bytes, that this block of data would have in a similar +system file that uses compression format 1. This is +@code{zheader_ofs} in the first block descriptor, and in each +succeeding block descriptor it is the sum of the previous desciptor's +@code{uncompressed_ofs} and @code{uncompressed_size}. + +@item int64 compressed_ofs; +The offset, in bytes, of the actual beginning of this compressed data +block. This is @code{zheader_ofs + 24} in the first block descriptor, +and in each succeeding block descriptor it is the sum of the previous +descriptor's @code{compressed_ofs} and @code{compressed_size}. The +final block descriptor's @code{compressed_ofs} and +@code{compressed_size} sum to @code{ztrailer_ofs}. + +@item int32 uncompressed_size; +The number of bytes in this data block, after decompression. This is +@code{block_size} in every data block except the last, which may be +smaller. + +@item int32 compressed_size; +The number of bytes in this data block, as stored compressed in this +system file. +@end table +@end table + @setfilename ignored diff --git a/src/data/sys-file-private.h b/src/data/sys-file-private.h index 21ff8ad..72f1ae3 100644 --- a/src/data/sys-file-private.h +++ b/src/data/sys-file-private.h @@ -1,5 +1,5 @@ /* PSPP - a program for statistical analysis. - Copyright (C) 2006-2007, 2009-2012 Free Software Foundation, Inc. + Copyright (C) 2006-2007, 2009-2013 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -35,12 +35,14 @@ struct dictionary; -/* Magic numbers. +/* ASCII magic numbers. */ +#define ASCII_MAGIC "$FL2" /* For regular files. */ +#define ASCII_ZMAGIC "$FL3" /* For ZLIB compressed files. */ - Both of these are actually $FL2 in the respective character set. The "FL2" - part is invariant among national variants of each character set, but "$" has - different encodings, so it is safer to write them as hexadecimal. */ -#define ASCII_MAGIC "\x24\x46\x4c\x32" +/* EBCDIC magic number, the same as ASCII_MAGIC but encoded in EBCDIC. + + No EBCDIC ZLIB compressed files have been observed, so we do not define + EBCDIC_ZMAGIC even though the value is obvious. */ #define EBCDIC_MAGIC "\x5b\xc6\xd3\xf2" /* A variable in a system file. */ diff --git a/src/data/sys-file-reader.c b/src/data/sys-file-reader.c index d553b3a..b6f5acf 100644 --- a/src/data/sys-file-reader.c +++ b/src/data/sys-file-reader.c @@ -24,6 +24,8 @@ #include <inttypes.h> #include <setjmp.h> #include <stdlib.h> +#include <sys/stat.h> +#include <zlib.h> #include "data/attributes.h" #include "data/case.h" @@ -57,6 +59,7 @@ #include "gl/minmax.h" #include "gl/unlocked-io.h" #include "gl/xalloc.h" +#include "gl/xalloc-oversized.h" #include "gl/xsize.h" #include "gettext.h" @@ -173,11 +176,21 @@ struct sfm_reader const char *encoding; /* String encoding. */ /* Decompression. */ - bool compressed; /* File is compressed? */ + enum sfm_compression compression; double bias; /* Compression bias, usually 100.0. */ uint8_t opcodes[8]; /* Current block of opcodes. */ size_t opcode_idx; /* Next opcode to interpret, 8 if none left. */ bool corruption_warning; /* Warned about possible corruption? */ + + /* ZLIB decompression. */ + long long int ztrailer_ofs; /* Offset of ZLIB trailer at end of file. */ +#define ZIN_BUF_SIZE 4096 + uint8_t *zin_buf; /* Inflation input buffer. */ +#define ZOUT_BUF_SIZE 16384 + uint8_t *zout_buf; /* Inflation output buffer. */ + unsigned int zout_end; /* Number of bytes of data in zout_buf. */ + unsigned int zout_pos; /* First unconsumed byte in zout_buf. */ + z_stream zstream; /* ZLIB inflater. */ }; static const struct casereader_class sys_file_casereader_class; @@ -200,10 +213,19 @@ static void sys_error (struct sfm_reader *, off_t, const char *, ...) static void read_bytes (struct sfm_reader *, void *, size_t); static bool try_read_bytes (struct sfm_reader *, void *, size_t); static int read_int (struct sfm_reader *); -static double read_float (struct sfm_reader *); +static long long int read_int64 (struct sfm_reader *); static void read_string (struct sfm_reader *, char *, size_t); static void skip_bytes (struct sfm_reader *, size_t); +/* ZLIB compressed data handling. */ +static void read_zheader (struct sfm_reader *); +static void open_zstream (struct sfm_reader *); +static void close_zstream (struct sfm_reader *); +static bool read_bytes_zlib (struct sfm_reader *, void *, size_t); +static void read_compressed_bytes (struct sfm_reader *, void *, size_t); +static bool try_read_compressed_bytes (struct sfm_reader *, void *, size_t); +static double read_compressed_float (struct sfm_reader *); + static char *fix_line_ends (const char *); static int parse_int (struct sfm_reader *, const void *data, size_t ofs); @@ -367,6 +389,7 @@ sfm_open_reader (struct file_handle *fh, const char *volatile encoding, r->error = false; r->opcode_idx = sizeof r->opcodes; r->corruption_warning = false; + r->zin_buf = r->zout_buf = NULL; info = infop ? infop : xmalloc (sizeof *info); memset (info, 0, sizeof *info); @@ -472,6 +495,9 @@ sfm_open_reader (struct file_handle *fh, const char *volatile encoding, } } + if (r->compression == SFM_COMP_ZLIB) + read_zheader (r); + /* Now actually parse what we read. First, figure out the correct character encoding, because this determines @@ -646,7 +672,9 @@ sfm_detect (FILE *file) return false; magic[4] = '\0'; - return !strcmp (ASCII_MAGIC, magic) || !strcmp (EBCDIC_MAGIC, magic); + return (!strcmp (ASCII_MAGIC, magic) + || !strcmp (ASCII_ZMAGIC, magic) + || !strcmp (EBCDIC_MAGIC, magic)); } /* Reads the global header of the system file. Initializes *HEADER and *INFO, @@ -658,12 +686,18 @@ read_header (struct sfm_reader *r, struct sfm_read_info *info, { uint8_t raw_layout_code[4]; uint8_t raw_bias[8]; + int compressed; + bool zmagic; read_string (r, header->magic, sizeof header->magic); read_string (r, header->eye_catcher, sizeof header->eye_catcher); - if (strcmp (ASCII_MAGIC, header->magic) - && strcmp (EBCDIC_MAGIC, header->magic)) + if (!strcmp (ASCII_MAGIC, header->magic) + || !strcmp (EBCDIC_MAGIC, header->magic)) + zmagic = false; + else if (!strcmp (ASCII_ZMAGIC, header->magic)) + zmagic = true; + else sys_error (r, 0, _("This is not an SPSS system file.")); /* Identify integer format. */ @@ -681,7 +715,25 @@ read_header (struct sfm_reader *r, struct sfm_read_info *info, || header->nominal_case_size > INT_MAX / 16) header->nominal_case_size = -1; - r->compressed = read_int (r) != 0; + compressed = read_int (r); + if (!zmagic) + { + if (compressed == 0) + r->compression = SFM_COMP_NONE; + else if (compressed == 1) + r->compression = SFM_COMP_SIMPLE; + else if (compressed != 0) + sys_error (r, 0, "System file header has invalid compression " + "value %d.", compressed); + } + else + { + if (compressed == 2) + r->compression = SFM_COMP_ZLIB; + else + sys_error (r, 0, "ZLIB-compressed system file header has invalid " + "compression value %d.", compressed); + } header->weight_idx = read_int (r); @@ -723,7 +775,7 @@ read_header (struct sfm_reader *r, struct sfm_read_info *info, info->integer_format = r->integer_format; info->float_format = r->float_format; - info->compressed = r->compressed; + info->compression = r->compression; info->case_cnt = r->case_cnt; } @@ -2289,7 +2341,7 @@ read_error (struct casereader *r, const struct sfm_reader *sfm) static bool read_case_number (struct sfm_reader *r, double *d) { - if (!r->compressed) + if (r->compression == SFM_COMP_NONE) { uint8_t number[8]; if (!try_read_bytes (r, number, sizeof number)) @@ -2339,13 +2391,13 @@ read_case_string (struct sfm_reader *r, uint8_t *s, size_t length) static int read_opcode (struct sfm_reader *r) { - assert (r->compressed); + assert (r->compression != SFM_COMP_NONE); for (;;) { int opcode; if (r->opcode_idx >= sizeof r->opcodes) { - if (!try_read_bytes (r, r->opcodes, sizeof r->opcodes)) + if (!try_read_compressed_bytes (r, r->opcodes, sizeof r->opcodes)) return -1; r->opcode_idx = 0; } @@ -2370,7 +2422,7 @@ read_compressed_number (struct sfm_reader *r, double *d) return false; case 253: - *d = read_float (r); + *d = read_compressed_float (r); break; case 254: @@ -2411,7 +2463,7 @@ read_compressed_string (struct sfm_reader *r, uint8_t *dst) return false; case 253: - read_bytes (r, dst, 8); + read_compressed_bytes (r, dst, 8); break; case 254: @@ -2453,7 +2505,7 @@ static bool read_whole_strings (struct sfm_reader *r, uint8_t *s, size_t length) { assert (length % 8 == 0); - if (!r->compressed) + if (r->compression == SFM_COMP_NONE) return try_read_bytes (r, s, length); else { @@ -2820,14 +2872,14 @@ read_int (struct sfm_reader *r) return integer_get (r->integer_format, integer, sizeof integer); } -/* Reads a 64-bit floating-point number from R and returns its - value in host format. */ -static double -read_float (struct sfm_reader *r) +/* Reads a 64-bit signed integer from R and returns its value in + host format. */ +static long long int +read_int64 (struct sfm_reader *r) { - uint8_t number[8]; - read_bytes (r, number, sizeof number); - return float_get_double (r->float_format, number); + uint8_t integer[8]; + read_bytes (r, integer, sizeof integer); + return integer_get (r->integer_format, integer, sizeof integer); } static int @@ -2894,6 +2946,179 @@ fix_line_ends (const char *s) return dst; } +static void * +zalloc (voidpf pool_, uInt items, uInt size) +{ + struct pool *pool = pool_; + + return (!size || xalloc_oversized (items, size) + ? Z_NULL + : pool_malloc (pool, items * size)); +} + +static void +zfree (voidpf pool_, voidpf address) +{ + struct pool *pool = pool_; + + pool_free (pool, address); +} + +static void +read_zheader (struct sfm_reader *r) +{ + off_t pos = r->pos; + long long int zheader_ofs = read_int64 (r); + long long int ztrailer_ofs = read_int64 (r); + long long int ztrailer_len = read_int64 (r); + struct stat s; + + if (zheader_ofs != pos) + sys_error (r, pos, _("Wrong ZLIB data header offset 0x%llx."), + zheader_ofs); + + if (ztrailer_ofs < r->pos) + sys_error (r, pos, _("Impossible ZLIB trailer offset 0x%llx."), + ztrailer_ofs); + + if (ztrailer_len < 24 || ztrailer_len % 24) + sys_error (r, pos, _("Invalid ZLIB trailer length %lld."), ztrailer_len); + + if (!fstat(fileno(r->file), &s) + && ztrailer_ofs + ztrailer_len != s.st_size) + sys_warn (r, pos, + _("End of ZLIB trailer (0x%llx) is not file size (0x%llx)."), + ztrailer_ofs + ztrailer_len, (long long int) s.st_size); + + r->ztrailer_ofs = ztrailer_ofs; + + if (r->zin_buf == NULL) + { + r->zin_buf = pool_malloc (r->pool, ZIN_BUF_SIZE); + r->zout_buf = pool_malloc (r->pool, ZOUT_BUF_SIZE); + r->zstream.next_in = NULL; + r->zstream.avail_in = 0; + } + + r->zstream.zalloc = zalloc; + r->zstream.zfree = zfree; + r->zstream.opaque = r->pool; + + open_zstream (r); +} + +static void +open_zstream (struct sfm_reader *r) +{ + int error; + + r->zout_pos = r->zout_end = 0; + error = inflateInit (&r->zstream); + if (error != Z_OK) + sys_error (r, r->pos, _("ZLIB initialization failed (%s)."), + r->zstream.msg); +} + +static void +close_zstream (struct sfm_reader *r) +{ + int error; + + error = inflateEnd (&r->zstream); + if (error != Z_OK) + sys_error (r, r->pos, _("Inconsistency at end of ZLIB stream (%s)."), + r->zstream.msg); +} + +static bool +read_bytes_zlib (struct sfm_reader *r, void *buf_, size_t byte_cnt) +{ + uint8_t *buf = buf_; + + if (byte_cnt == 0) + return true; + + for (;;) + { + int error; + + /* Use already inflated data if there is any. */ + if (r->zout_pos < r->zout_end) + { + unsigned int n = MIN (byte_cnt, r->zout_end - r->zout_pos); + memcpy (buf, &r->zout_buf[r->zout_pos], n); + r->zout_pos += n; + byte_cnt -= n; + buf += n; + + if (byte_cnt == 0) + return true; + } + + /* We need to inflate some more data. + Get some more input data if we don't have any. */ + if (r->zstream.avail_in == 0) + { + unsigned int n = MIN (ZIN_BUF_SIZE, r->ztrailer_ofs - r->pos); + if (n == 0 || !try_read_bytes (r, r->zin_buf, n)) + return false; + r->zstream.avail_in = n; + r->zstream.next_in = r->zin_buf; + } + + /* Inflate the (remaining) input data. */ + r->zstream.avail_out = ZOUT_BUF_SIZE; + r->zstream.next_out = r->zout_buf; + error = inflate (&r->zstream, Z_SYNC_FLUSH); + r->zout_pos = 0; + r->zout_end = r->zstream.next_out - r->zout_buf; + if (r->zout_end == 0) + { + if (error == Z_STREAM_END) + { + close_zstream (r); + open_zstream (r); + } + else + sys_error (r, r->pos, _("ZLIB stream inconsistency (%s)."), + r->zstream.msg); + } + else + { + /* Process the output data and ignore 'error' for now. ZLIB will + present it to us again on the next inflate() call. */ + } + } +} + +static void +read_compressed_bytes (struct sfm_reader *r, void *buf, size_t byte_cnt) +{ + if (r->compression == SFM_COMP_SIMPLE) + return read_bytes (r, buf, byte_cnt); + else if (!read_bytes_zlib (r, buf, byte_cnt)) + sys_error (r, r->pos, _("Unexpected end of ZLIB compressed data.")); +} + +static bool +try_read_compressed_bytes (struct sfm_reader *r, void *buf, size_t byte_cnt) +{ + if (r->compression == SFM_COMP_SIMPLE) + return try_read_bytes (r, buf, byte_cnt); + else + return read_bytes_zlib (r, buf, byte_cnt); +} + +/* Reads a 64-bit floating-point number from R and returns its + value in host format. */ +static double +read_compressed_float (struct sfm_reader *r) +{ + uint8_t number[8]; + read_compressed_bytes (r, number, sizeof number); + return float_get_double (r->float_format, number); +} + static const struct casereader_class sys_file_casereader_class = { sys_file_casereader_read, diff --git a/src/data/sys-file-reader.h b/src/data/sys-file-reader.h index 037d33a..52457a0 100644 --- a/src/data/sys-file-reader.h +++ b/src/data/sys-file-reader.h @@ -26,6 +26,14 @@ /* Reading system files. */ +/* System file compression format. */ +enum sfm_compression + { + SFM_COMP_NONE, /* No compression. */ + SFM_COMP_SIMPLE, /* Bytecode compression of integer values. */ + SFM_COMP_ZLIB /* ZLIB "deflate" compression. */ + }; + /* System file info that doesn't fit in struct dictionary. The strings in this structure are encoded in UTF-8. (They are normally in @@ -36,7 +44,7 @@ struct sfm_read_info char *creation_time; /* "hh:mm:ss". */ enum integer_format integer_format; enum float_format float_format; - bool compressed; /* 0=no, 1=yes. */ + enum sfm_compression compression; casenumber case_cnt; /* -1 if unknown. */ char *product; /* Product name. */ char *product_ext; /* Extra product info. */ diff --git a/src/language/dictionary/sys-file-info.c b/src/language/dictionary/sys-file-info.c index 3327a2c..c7f326f 100644 --- a/src/language/dictionary/sys-file-info.c +++ b/src/language/dictionary/sys-file-info.c @@ -150,10 +150,11 @@ cmd_sysfile_info (struct lexer *lexer, struct dataset *ds UNUSED) ? var_get_name (weight_var) : _("Not weighted."))); } - tab_text (t, 0, r, TAB_LEFT, _("Mode:")); + tab_text (t, 0, r, TAB_LEFT, _("Compression:")); tab_text_format (t, 1, r++, TAB_LEFT, - _("Compression %s."), info.compressed ? _("on") : _("off")); - + info.compression == SFM_COMP_NONE ? _("None") + : info.compression == SFM_COMP_SIMPLE ? "SAV" + : "ZSAV"); tab_text (t, 0, r, TAB_LEFT, _("Charset:")); tab_text (t, 1, r++, TAB_LEFT, dict_get_encoding (d)); diff --git a/utilities/pspp-dump-sav.c b/utilities/pspp-dump-sav.c index c6b5823..8eaf836 100644 --- a/utilities/pspp-dump-sav.c +++ b/utilities/pspp-dump-sav.c @@ -39,6 +39,13 @@ #define ID_MAX_LEN 64 +enum compression + { + COMP_NONE, + COMP_SIMPLE, + COMP_ZLIB + }; + struct sfm_reader { const char *file_name; @@ -52,7 +59,7 @@ struct sfm_reader enum integer_format integer_format; enum float_format float_format; - bool compressed; + enum compression compression; double bias; }; @@ -87,7 +94,8 @@ static void read_long_string_missing_values (struct sfm_reader *r, size_t size, size_t count); static void read_unknown_extension (struct sfm_reader *, size_t size, size_t count); -static void read_compressed_data (struct sfm_reader *, int max_cases); +static void read_simple_compressed_data (struct sfm_reader *, int max_cases); +static void read_zlib_compressed_data (struct sfm_reader *); static struct text_record *open_text_record ( struct sfm_reader *, size_t size); @@ -180,7 +188,7 @@ main (int argc, char *argv[]) r.n_var_widths = 0; r.allocated_var_widths = 0; r.var_widths = 0; - r.compressed = false; + r.compression = COMP_NONE; if (argc - optind > 1) printf ("Reading \"%s\":\n", r.file_name); @@ -218,8 +226,13 @@ main (int argc, char *argv[]) (long long int) ftello (r.file), (long long int) ftello (r.file) + 4); - if (r.compressed && max_cases > 0) - read_compressed_data (&r, max_cases); + if (r.compression == COMP_SIMPLE) + { + if (max_cases > 0) + read_simple_compressed_data (&r, max_cases); + } + else if (r.compression == COMP_ZLIB) + read_zlib_compressed_data (&r); fclose (r.file); } @@ -241,11 +254,16 @@ read_header (struct sfm_reader *r) char creation_date[10]; char creation_time[9]; char file_label[65]; + bool zmagic; read_string (r, rec_type, sizeof rec_type); read_string (r, eye_catcher, sizeof eye_catcher); - if (strcmp ("$FL2", rec_type) != 0) + if (!strcmp ("$FL2", rec_type)) + zmagic = false; + else if (!strcmp ("$FL3", rec_type)) + zmagic = true; + else sys_error (r, "This is not an SPSS system file."); /* Identify integer format. */ @@ -265,7 +283,24 @@ read_header (struct sfm_reader *r) weight_index = read_int (r); ncases = read_int (r); - r->compressed = compressed != 0; + if (!zmagic) + { + if (compressed == 0) + r->compression = COMP_NONE; + else if (compressed == 1) + r->compression = COMP_SIMPLE; + else if (compressed != 0) + sys_error (r, "SAV file header has invalid compression value " + "%"PRId32".", compressed); + } + else + { + if (compressed == 2) + r->compression = COMP_ZLIB; + else + sys_error (r, "ZSAV file header has invalid compression value " + "%"PRId32".", compressed); + } /* Identify floating-point format and obtain compression bias. */ read_bytes (r, raw_bias, sizeof raw_bias); @@ -289,7 +324,12 @@ read_header (struct sfm_reader *r) printf ("File header record:\n"); printf ("\t%17s: %s\n", "Product name", eye_catcher); printf ("\t%17s: %"PRId32"\n", "Layout code", layout_code); - printf ("\t%17s: %"PRId32"\n", "Compressed", compressed); + printf ("\t%17s: %"PRId32" (%s)\n", "Compressed", + compressed, + r->compression == COMP_NONE ? "no compression" + : r->compression == COMP_SIMPLE ? "simple compression" + : r->compression == COMP_ZLIB ? "ZLIB compression" + : "<error>"); printf ("\t%17s: %"PRId32"\n", "Weight index", weight_index); printf ("\t%17s: %"PRId32"\n", "Number of cases", ncases); printf ("\t%17s: %g\n", "Compression bias", r->bias); @@ -1170,7 +1210,7 @@ read_variable_attributes (struct sfm_reader *r, size_t size, size_t count) } static void -read_compressed_data (struct sfm_reader *r, int max_cases) +read_simple_compressed_data (struct sfm_reader *r, int max_cases) { enum { N_OPCODES = 8 }; uint8_t opcodes[N_OPCODES]; @@ -1258,6 +1298,87 @@ read_compressed_data (struct sfm_reader *r, int max_cases) } } } + +static void +read_zlib_compressed_data (struct sfm_reader *r) +{ + long long int ofs; + long long int this_ofs, next_ofs, next_len; + long long int bias, zero; + long long int running_uncmp_ofs, running_cmp_ofs; + unsigned int block_size, n_blocks; + unsigned int i; + + read_int (r); + ofs = ftello (r->file); + printf ("\n%08llx: ZLIB compressed data header:\n", ofs); + + this_ofs = read_int64 (r); + next_ofs = read_int64 (r); + next_len = read_int64 (r); + + printf ("\tzheader_ofs: 0x%llx\n", this_ofs); + if (this_ofs != ofs) + printf ("\t\t(Expected 0x%llx.)\n", ofs); + printf ("\tztrailer_ofs: 0x%llx\n", next_ofs); + printf ("\tztrailer_len: %lld\n", next_len); + if (next_len < 24 || next_len % 24) + printf ("\t\t(Trailer length is not a positive multiple of 24.)\n"); + + printf ("\n%08llx: 0x%llx bytes of ZLIB compressed data\n", + ofs + 8 * 3, next_ofs - (ofs + 8 * 3)); + + skip_bytes (r, next_ofs - (ofs + 8 * 3)); + + printf ("\n%08llx: ZLIB trailer fixed header:\n", next_ofs); + bias = read_int64 (r); + zero = read_int64 (r); + block_size = read_int (r); + n_blocks = read_int (r); + printf ("\tbias: %lld\n", bias); + printf ("\tzero: 0x%llx\n", zero); + if (zero != 0) + printf ("\t\t(Expected 0.)\n"); + printf ("\tblock_size: 0x%x\n", block_size); + if (block_size != 0x3ff000) + printf ("\t\t(Expected 0x3ff000.)\n"); + printf ("\tn_blocks: %u\n", n_blocks); + if (n_blocks != next_len / 24 - 1) + printf ("\t\t(Expected %llu.)\n", next_len / 24 - 1); + + running_uncmp_ofs = ofs; + running_cmp_ofs = ofs + 24; + for (i = 0; i < n_blocks; i++) + { + long long int blockinfo_ofs = ftello (r->file); + unsigned long long int uncompressed_ofs = read_int64 (r); + unsigned long long int compressed_ofs = read_int64 (r); + unsigned int uncompressed_size = read_int (r); + unsigned int compressed_size = read_int (r); + + printf ("\n%08llx: ZLIB block descriptor %d\n", blockinfo_ofs, i + 1); + + printf ("\tuncompressed_ofs: 0x%llx\n", uncompressed_ofs); + if (i == 0 && uncompressed_ofs != running_uncmp_ofs) + printf ("\t\t(Expected 0x%llx.)\n", ofs); + + printf ("\tcompressed_ofs: 0x%llx\n", compressed_ofs); + if (i == 0 && compressed_ofs != running_cmp_ofs) + printf ("\t\t(Expected 0x%llx.)\n", ofs + 24); + + printf ("\tuncompressed_size: 0x%x\n", uncompressed_size); + if (i < n_blocks - 1 && uncompressed_size != block_size) + printf ("\t\t(Expected 0x%x.)\n", block_size); + + printf ("\tcompressed_size: 0x%x\n", compressed_size); + if (i == n_blocks - 1 && compressed_ofs + compressed_size != next_ofs) + printf ("\t\t(This was expected to be 0x%llx.)\n", + next_ofs - compressed_size); + + running_uncmp_ofs += uncompressed_size; + running_cmp_ofs += compressed_size; + } +} /* Helpers for reading records that consist of structured text strings. */ -- 1.7.10.4 _______________________________________________ pspp-dev mailing list pspp-dev@gnu.org https://lists.gnu.org/mailman/listinfo/pspp-dev