first draft ZSAV implementation

Ben Pfaff Tue, 15 Oct 2013 00:17:31 -0700

I'm working on a ZSAV implementation.  Since users seem eager for this,
here's a first draft.  It reads all the ZSAV files I've encountered so
far.  It needs some tests and probably a writer implementation.  Those
will take a few days.


--8<--------------------------cut here-------------------------->8--

From: Ben Pfaff <[email protected]>
Date: Tue, 15 Oct 2013 00:14:01 -0700
Subject: [PATCH] Work on ZSAV implementation.

---
 doc/dev/system-file-format.texi         |  182 ++++++++++++++++++---
 src/data/sys-file-private.h             |   14 +-
 src/data/sys-file-reader.c              |  265 ++++++++++++++++++++++++++++---
 src/data/sys-file-reader.h              |   10 +-
 src/language/dictionary/sys-file-info.c |    7 +-
 utilities/pspp-dump-sav.c               |  139 ++++++++++++++--
 6 files changed, 558 insertions(+), 59 deletions(-)

diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi
index f408ff2..fc9a455 100644
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -56,6 +56,18 @@ appears in system files only in missing value ranges, which 
never
 contain SYSMIS.
 @end table
 
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
+@code{rec_type} in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings.  The best way to determine
+the specific encoding in use is to consult the character encoding
+record (@pxref{Character Encoding Record}), if present, and failing
+that the @code{character_code} in the machine integer info record
+(@pxref{Machine Integer Info Record}).  The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different encodings
+(@pxref{Character Encoding Record}).
+
 System files are divided into records, each of which begins with a
 4-byte record type, usually regarded as an @code{int32}.
 
@@ -121,7 +133,7 @@ char                rec_type[4];
 char                prod_name[60];
 int32               layout_code;
 int32               nominal_case_size;
-int32               compressed;
+int32               compression;
 int32               weight_index;
 int32               ncases;
 flt64               bias;
@@ -133,9 +145,17 @@ char                padding[3];
 
 @table @code
 @item char rec_type[4];
-Record type code, set to @samp{$FL2}, that is, either @code{24 46 4c
-32} if the file uses an ASCII-based character encoding, or @code{5b c6
-d3 f2} if the file uses an EBCDIC-based character encoding.
+Record type code, either @samp{$FL2} for system files with
+uncompressed data or data compressed with simple bytecode compression,
+or @samp{$FL3} for system files with ZLIB compressed data.
+
+This is truly a character field that uses the character encoding as
+other strings.  Thus, in a file with an ASCII-based character encoding
+this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
+file with an EBCDIC-based encoding this field contains @code{5b c6 d3
+f2}.  (SPSS documentation states that ZLIB-compressed files must be
+encoded in UTF-8, so EBCDIC-based ZLIB-compressed files presumably do
+not exist.)
 
 @item char prod_name[60];
 Product identification string.  This always begins with the characters
@@ -160,7 +180,10 @@ files written by some systems set this value to -1.  In 
general, it is
 unsafe for systems reading system files to rely upon this value.
 
 @item int32 compressed;
-Set to 1 if the data in the file is compressed, 0 otherwise.
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression, 2 if the data is ZLIB
+compressed.  This field has value 2 if and only if @code{rec_type} is
+@samp{$FL3}.
 
 @item int32 weight_index;
 If one of the variables in the data set is used as a weighting
@@ -577,7 +600,8 @@ Floating point representation code.  For IEEE 754 systems 
this is 1.
 IBM 370 sets this to 2, and DEC VAX E to 3.
 
 @item int32 compression_code;
-Compression code.  Always set to 1.
+Compression code.  Always set to 1, regardless of whether or how the
+file is compressed.
 
 @item int32 endianness;
 Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
@@ -1434,22 +1458,23 @@ Ignored padding.  Should be set to 0.
 @node Data Record
 @section Data Record
 
-Data records must follow all other records in the system file.  There must
-be at least one data record in every system file.
-
-The format of data records varies depending on whether the data is
-compressed.  Regardless, the data is arranged in a series of 8-byte
-elements.
+The data record must follow all other records in the system file.
+Every system file must have a data record that specifies data for at
+least one case.  The format of the data record varies depending on the
+value of @code{compression} in the file header record:
 
-When data is not compressed,
-each element corresponds to
+@table @asis
+@item 0: no compression
+Data is arranged as a series of 8-byte elements.
+Each element corresponds to
 the variable declared in the respective variable record (@pxref{Variable
 Record}).  Numeric values are given in @code{flt64} format; string
 values are literal characters string, padded on the right when
 necessary to fill out 8-byte units.
 
-Compressed data is arranged in the following manner: the first 8 bytes
-in the data section is divided into a series of 1-byte command
+@item 1: bytecode compression
+The first 8 bytes
+of the data record is divided into a series of 1-byte command
 codes.  These codes have meanings as described below:
 
 @table @asis
@@ -1487,8 +1512,125 @@ An 8-byte string value that is all spaces.
 The system-missing value.
 @end table
 
-When the end of the an 8-byte group of command bytes is reached, any
-blocks of non-compressible values indicated by code 253 are skipped,
-and the next element of command bytes is read and interpreted, until
-the end of the file or a code with value 252 is reached.
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 253.  After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values.  The pattern repeats to the end of the file
+or a code with value 252.
+
+@item 2: ZLIB compression
+The data record consists of the following, in order:
+
+@itemize @bullet
+@item
+ZLIB data header, 24 bytes long.
+
+@item
+One or more variable-length blocks of ZLIB compressed data.
+
+@item
+ZLIB data trailer, with a 24-byte fixed header plus an additional 24
+bytes for each preceding ZLIB compressed data block.
+@end itemize
+
+The ZLIB data header has the following format:
+
+@example
+int64               zheader_ofs;
+int64               ztrailer_ofs;
+int64               ztrailer_len;
+@end example
+
+@table @code
+@item int64 zheader_ofs;
+The offset, in bytes, of the beginning of this structure within the
+system file.
+
+@item int64 ztrailer_ofs;
+The offset, in bytes, of the first byte of the ZLIB data trailer.
+
+@item int64 ztrailer_len;
+The number of bytes in the ZLIB data trailer.  This and the previous
+field sum to the size of the system file in bytes.
+@end table
+
+The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+compressed data blocks.  Each ZLIB compressed data block begins with a
+ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
+01} (the only header yet observed in practice).  Each block
+decompresses to a fixed number of bytes (in practice only
+@code{0x3ff000}-byte blocks have been observed), except that the last
+block of data may be shorter.  The last ZLIB compressed data block
+ends just before offset @code{ztrailer_ofs}.
+
+The result of ZLIB decompression is bytecode compressed data as
+described above for compression format 1.
+
+The ZLIB data trailer begins with the following 24-byte fixed header:
+
+@example
+int64               bias;
+int64               zero;
+int32               block_size;
+int32               n_blocks;
+@end example
+
+@table @code
+@item int64 int_bias;
+The compression bias as a negative integer, e.g.@: if @code{bias} in
+the file header record is 100.0, then @code{int_bias} is @minus{}100
+(this is the only value yet observed in practice).
+
+@item int64 zero;
+Always observed to be zero.
+
+@item int32 block_size;
+The number of bytes in each ZLIB compressed data block, except
+possibly the last, following decompression.  Only @code{0x3ff000} has
+been observed so far.
+
+@item int32 n_blocks;
+The number of ZLIB compressed data blocks, always exactly
+@code{(ztrailer_ofs - 24) / 24}.
+@end table
+
+The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
+block descriptors, each of which describes the compressed data block
+corresponding to its offset.  Each block descriptor has the following
+format:
+
+@example
+int64               uncompressed_ofs;
+int64               compressed_ofs;
+int32               uncompressed_size;
+int32               compressed_size;
+@end example
+
+@table @code
+@item int64 uncompressed_ofs;
+The offset, in bytes, that this block of data would have in a similar
+system file that uses compression format 1.  This is
+@code{zheader_ofs} in the first block descriptor, and in each
+succeeding block descriptor it is the sum of the previous desciptor's
+@code{uncompressed_ofs} and @code{uncompressed_size}.
+
+@item int64 compressed_ofs;
+The offset, in bytes, of the actual beginning of this compressed data
+block.  This is @code{zheader_ofs + 24} in the first block descriptor,
+and in each succeeding block descriptor it is the sum of the previous
+descriptor's @code{compressed_ofs} and @code{compressed_size}.  The
+final block descriptor's @code{compressed_ofs} and
+@code{compressed_size} sum to @code{ztrailer_ofs}.
+
+@item int32 uncompressed_size;
+The number of bytes in this data block, after decompression.  This is
+@code{block_size} in every data block except the last, which may be
+smaller.
+
+@item int32 compressed_size;
+The number of bytes in this data block, as stored compressed in this
+system file.
+@end table
+@end table
+
 @setfilename ignored
diff --git a/src/data/sys-file-private.h b/src/data/sys-file-private.h
index 21ff8ad..72f1ae3 100644
--- a/src/data/sys-file-private.h
+++ b/src/data/sys-file-private.h
@@ -1,5 +1,5 @@
 /* PSPP - a program for statistical analysis.
-   Copyright (C) 2006-2007, 2009-2012 Free Software Foundation, Inc.
+   Copyright (C) 2006-2007, 2009-2013 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -35,12 +35,14 @@
 
 struct dictionary;
 
-/* Magic numbers.
+/* ASCII magic numbers. */
+#define ASCII_MAGIC  "$FL2"     /* For regular files. */
+#define ASCII_ZMAGIC "$FL3"     /* For ZLIB compressed files. */
 
-   Both of these are actually $FL2 in the respective character set.  The "FL2"
-   part is invariant among national variants of each character set, but "$" has
-   different encodings, so it is safer to write them as hexadecimal. */
-#define ASCII_MAGIC  "\x24\x46\x4c\x32"
+/* EBCDIC magic number, the same as ASCII_MAGIC but encoded in EBCDIC.
+
+   No EBCDIC ZLIB compressed files have been observed, so we do not define
+   EBCDIC_ZMAGIC even though the value is obvious. */
 #define EBCDIC_MAGIC "\x5b\xc6\xd3\xf2"
 
 /* A variable in a system file. */
diff --git a/src/data/sys-file-reader.c b/src/data/sys-file-reader.c
index d553b3a..b6f5acf 100644
--- a/src/data/sys-file-reader.c
+++ b/src/data/sys-file-reader.c
@@ -24,6 +24,8 @@
 #include <inttypes.h>
 #include <setjmp.h>
 #include <stdlib.h>
+#include <sys/stat.h>
+#include <zlib.h>
 
 #include "data/attributes.h"
 #include "data/case.h"
@@ -57,6 +59,7 @@
 #include "gl/minmax.h"
 #include "gl/unlocked-io.h"
 #include "gl/xalloc.h"
+#include "gl/xalloc-oversized.h"
 #include "gl/xsize.h"
 
 #include "gettext.h"
@@ -173,11 +176,21 @@ struct sfm_reader
     const char *encoding;       /* String encoding. */
 
     /* Decompression. */
-    bool compressed;           /* File is compressed? */
+    enum sfm_compression compression;
     double bias;               /* Compression bias, usually 100.0. */
     uint8_t opcodes[8];         /* Current block of opcodes. */
     size_t opcode_idx;          /* Next opcode to interpret, 8 if none left. */
     bool corruption_warning;    /* Warned about possible corruption? */
+
+    /* ZLIB decompression. */
+    long long int ztrailer_ofs; /* Offset of ZLIB trailer at end of file. */
+#define ZIN_BUF_SIZE  4096
+    uint8_t *zin_buf;           /* Inflation input buffer. */
+#define ZOUT_BUF_SIZE 16384
+    uint8_t *zout_buf;          /* Inflation output buffer. */
+    unsigned int zout_end;      /* Number of bytes of data in zout_buf. */
+    unsigned int zout_pos;      /* First unconsumed byte in zout_buf. */
+    z_stream zstream;           /* ZLIB inflater. */
   };
 
 static const struct casereader_class sys_file_casereader_class;
@@ -200,10 +213,19 @@ static void sys_error (struct sfm_reader *, off_t, const 
char *, ...)
 static void read_bytes (struct sfm_reader *, void *, size_t);
 static bool try_read_bytes (struct sfm_reader *, void *, size_t);
 static int read_int (struct sfm_reader *);
-static double read_float (struct sfm_reader *);
+static long long int read_int64 (struct sfm_reader *);
 static void read_string (struct sfm_reader *, char *, size_t);
 static void skip_bytes (struct sfm_reader *, size_t);
 
+/* ZLIB compressed data handling. */
+static void read_zheader (struct sfm_reader *);
+static void open_zstream (struct sfm_reader *);
+static void close_zstream (struct sfm_reader *);
+static bool read_bytes_zlib (struct sfm_reader *, void *, size_t);
+static void read_compressed_bytes (struct sfm_reader *, void *, size_t);
+static bool try_read_compressed_bytes (struct sfm_reader *, void *, size_t);
+static double read_compressed_float (struct sfm_reader *);
+
 static char *fix_line_ends (const char *);
 
 static int parse_int (struct sfm_reader *, const void *data, size_t ofs);
@@ -367,6 +389,7 @@ sfm_open_reader (struct file_handle *fh, const char 
*volatile encoding,
   r->error = false;
   r->opcode_idx = sizeof r->opcodes;
   r->corruption_warning = false;
+  r->zin_buf = r->zout_buf = NULL;
 
   info = infop ? infop : xmalloc (sizeof *info);
   memset (info, 0, sizeof *info);
@@ -472,6 +495,9 @@ sfm_open_reader (struct file_handle *fh, const char 
*volatile encoding,
         }
     }
 
+  if (r->compression == SFM_COMP_ZLIB)
+    read_zheader (r);
+
   /* Now actually parse what we read.
 
      First, figure out the correct character encoding, because this determines
@@ -646,7 +672,9 @@ sfm_detect (FILE *file)
     return false;
   magic[4] = '\0';
 
-  return !strcmp (ASCII_MAGIC, magic) || !strcmp (EBCDIC_MAGIC, magic);
+  return (!strcmp (ASCII_MAGIC, magic)
+          || !strcmp (ASCII_ZMAGIC, magic)
+          || !strcmp (EBCDIC_MAGIC, magic));
 }
 
 /* Reads the global header of the system file.  Initializes *HEADER and *INFO,
@@ -658,12 +686,18 @@ read_header (struct sfm_reader *r, struct sfm_read_info 
*info,
 {
   uint8_t raw_layout_code[4];
   uint8_t raw_bias[8];
+  int compressed;
+  bool zmagic;
 
   read_string (r, header->magic, sizeof header->magic);
   read_string (r, header->eye_catcher, sizeof header->eye_catcher);
 
-  if (strcmp (ASCII_MAGIC, header->magic)
-      && strcmp (EBCDIC_MAGIC, header->magic))
+  if (!strcmp (ASCII_MAGIC, header->magic)
+      || !strcmp (EBCDIC_MAGIC, header->magic))
+    zmagic = false;
+  else if (!strcmp (ASCII_ZMAGIC, header->magic))
+    zmagic = true;
+  else
     sys_error (r, 0, _("This is not an SPSS system file."));
 
   /* Identify integer format. */
@@ -681,7 +715,25 @@ read_header (struct sfm_reader *r, struct sfm_read_info 
*info,
       || header->nominal_case_size > INT_MAX / 16)
     header->nominal_case_size = -1;
 
-  r->compressed = read_int (r) != 0;
+  compressed = read_int (r);
+  if (!zmagic)
+    {
+      if (compressed == 0)
+        r->compression = SFM_COMP_NONE;
+      else if (compressed == 1)
+        r->compression = SFM_COMP_SIMPLE;
+      else if (compressed != 0)
+        sys_error (r, 0, "System file header has invalid compression "
+                   "value %d.", compressed);
+    }
+  else
+    {
+      if (compressed == 2)
+        r->compression = SFM_COMP_ZLIB;
+      else
+        sys_error (r, 0, "ZLIB-compressed system file header has invalid "
+                   "compression value %d.", compressed);
+    }
 
   header->weight_idx = read_int (r);
 
@@ -723,7 +775,7 @@ read_header (struct sfm_reader *r, struct sfm_read_info 
*info,
 
   info->integer_format = r->integer_format;
   info->float_format = r->float_format;
-  info->compressed = r->compressed;
+  info->compression = r->compression;
   info->case_cnt = r->case_cnt;
 }
 
@@ -2289,7 +2341,7 @@ read_error (struct casereader *r, const struct sfm_reader 
*sfm)
 static bool
 read_case_number (struct sfm_reader *r, double *d)
 {
-  if (!r->compressed)
+  if (r->compression == SFM_COMP_NONE)
     {
       uint8_t number[8];
       if (!try_read_bytes (r, number, sizeof number))
@@ -2339,13 +2391,13 @@ read_case_string (struct sfm_reader *r, uint8_t *s, 
size_t length)
 static int
 read_opcode (struct sfm_reader *r)
 {
-  assert (r->compressed);
+  assert (r->compression != SFM_COMP_NONE);
   for (;;)
     {
       int opcode;
       if (r->opcode_idx >= sizeof r->opcodes)
         {
-          if (!try_read_bytes (r, r->opcodes, sizeof r->opcodes))
+          if (!try_read_compressed_bytes (r, r->opcodes, sizeof r->opcodes))
             return -1;
           r->opcode_idx = 0;
         }
@@ -2370,7 +2422,7 @@ read_compressed_number (struct sfm_reader *r, double *d)
       return false;
 
     case 253:
-      *d = read_float (r);
+      *d = read_compressed_float (r);
       break;
 
     case 254:
@@ -2411,7 +2463,7 @@ read_compressed_string (struct sfm_reader *r, uint8_t 
*dst)
       return false;
 
     case 253:
-      read_bytes (r, dst, 8);
+      read_compressed_bytes (r, dst, 8);
       break;
 
     case 254:
@@ -2453,7 +2505,7 @@ static bool
 read_whole_strings (struct sfm_reader *r, uint8_t *s, size_t length)
 {
   assert (length % 8 == 0);
-  if (!r->compressed)
+  if (r->compression == SFM_COMP_NONE)
     return try_read_bytes (r, s, length);
   else
     {
@@ -2820,14 +2872,14 @@ read_int (struct sfm_reader *r)
   return integer_get (r->integer_format, integer, sizeof integer);
 }
 
-/* Reads a 64-bit floating-point number from R and returns its
-   value in host format. */
-static double
-read_float (struct sfm_reader *r)
+/* Reads a 64-bit signed integer from R and returns its value in
+   host format. */
+static long long int
+read_int64 (struct sfm_reader *r)
 {
-  uint8_t number[8];
-  read_bytes (r, number, sizeof number);
-  return float_get_double (r->float_format, number);
+  uint8_t integer[8];
+  read_bytes (r, integer, sizeof integer);
+  return integer_get (r->integer_format, integer, sizeof integer);
 }
 
 static int
@@ -2894,6 +2946,179 @@ fix_line_ends (const char *s)
   return dst;
 }
 
+static void *
+zalloc (voidpf pool_, uInt items, uInt size)
+{
+  struct pool *pool = pool_;
+
+  return (!size || xalloc_oversized (items, size)
+          ? Z_NULL
+          : pool_malloc (pool, items * size));
+}
+
+static void
+zfree (voidpf pool_, voidpf address)
+{
+  struct pool *pool = pool_;
+
+  pool_free (pool, address);
+}
+
+static void
+read_zheader (struct sfm_reader *r)
+{
+  off_t pos = r->pos;
+  long long int zheader_ofs = read_int64 (r);
+  long long int ztrailer_ofs = read_int64 (r);
+  long long int ztrailer_len = read_int64 (r);
+  struct stat s;
+
+  if (zheader_ofs != pos)
+    sys_error (r, pos, _("Wrong ZLIB data header offset 0x%llx."),
+               zheader_ofs);
+
+  if (ztrailer_ofs < r->pos)
+    sys_error (r, pos, _("Impossible ZLIB trailer offset 0x%llx."),
+               ztrailer_ofs);
+
+  if (ztrailer_len < 24 || ztrailer_len % 24)
+    sys_error (r, pos, _("Invalid ZLIB trailer length %lld."), ztrailer_len);
+
+  if (!fstat(fileno(r->file), &s)
+      && ztrailer_ofs + ztrailer_len != s.st_size)
+    sys_warn (r, pos,
+              _("End of ZLIB trailer (0x%llx) is not file size (0x%llx)."),
+              ztrailer_ofs + ztrailer_len, (long long int) s.st_size);
+
+  r->ztrailer_ofs = ztrailer_ofs;
+
+  if (r->zin_buf == NULL)
+    {
+      r->zin_buf = pool_malloc (r->pool, ZIN_BUF_SIZE);
+      r->zout_buf = pool_malloc (r->pool, ZOUT_BUF_SIZE);
+      r->zstream.next_in = NULL;
+      r->zstream.avail_in = 0;
+    }
+
+  r->zstream.zalloc = zalloc;
+  r->zstream.zfree = zfree;
+  r->zstream.opaque = r->pool;
+
+  open_zstream (r);
+}
+
+static void
+open_zstream (struct sfm_reader *r)
+{
+  int error;
+
+  r->zout_pos = r->zout_end = 0;
+  error = inflateInit (&r->zstream);
+  if (error != Z_OK)
+    sys_error (r, r->pos, _("ZLIB initialization failed (%s)."),
+               r->zstream.msg);
+}
+
+static void
+close_zstream (struct sfm_reader *r)
+{
+  int error;
+
+  error = inflateEnd (&r->zstream);
+  if (error != Z_OK)
+    sys_error (r, r->pos, _("Inconsistency at end of ZLIB stream (%s)."),
+               r->zstream.msg);
+}
+
+static bool
+read_bytes_zlib (struct sfm_reader *r, void *buf_, size_t byte_cnt)
+{
+  uint8_t *buf = buf_;
+
+  if (byte_cnt == 0)
+    return true;
+
+  for (;;)
+    {
+      int error;
+
+      /* Use already inflated data if there is any. */
+      if (r->zout_pos < r->zout_end)
+        {
+          unsigned int n = MIN (byte_cnt, r->zout_end - r->zout_pos);
+          memcpy (buf, &r->zout_buf[r->zout_pos], n);
+          r->zout_pos += n;
+          byte_cnt -= n;
+          buf += n;
+
+          if (byte_cnt == 0)
+            return true;
+        }
+
+      /* We need to inflate some more data.
+         Get some more input data if we don't have any. */
+      if (r->zstream.avail_in == 0)
+        {
+          unsigned int n = MIN (ZIN_BUF_SIZE, r->ztrailer_ofs - r->pos);
+          if (n == 0 || !try_read_bytes (r, r->zin_buf, n))
+            return false;
+          r->zstream.avail_in = n;
+          r->zstream.next_in = r->zin_buf;
+        }
+
+      /* Inflate the (remaining) input data. */
+      r->zstream.avail_out = ZOUT_BUF_SIZE;
+      r->zstream.next_out = r->zout_buf;
+      error = inflate (&r->zstream, Z_SYNC_FLUSH);
+      r->zout_pos = 0;
+      r->zout_end = r->zstream.next_out - r->zout_buf;
+      if (r->zout_end == 0)
+        {
+          if (error == Z_STREAM_END)
+            {
+              close_zstream (r);
+              open_zstream (r);
+            }
+          else
+            sys_error (r, r->pos, _("ZLIB stream inconsistency (%s)."),
+                       r->zstream.msg);
+        }
+      else
+        {
+          /* Process the output data and ignore 'error' for now.  ZLIB will
+             present it to us again on the next inflate() call. */
+        }
+    }
+}
+
+static void
+read_compressed_bytes (struct sfm_reader *r, void *buf, size_t byte_cnt)
+{
+  if (r->compression == SFM_COMP_SIMPLE)
+    return read_bytes (r, buf, byte_cnt);
+  else if (!read_bytes_zlib (r, buf, byte_cnt))
+    sys_error (r, r->pos, _("Unexpected end of ZLIB compressed data."));
+}
+
+static bool
+try_read_compressed_bytes (struct sfm_reader *r, void *buf, size_t byte_cnt)
+{
+  if (r->compression == SFM_COMP_SIMPLE)
+    return try_read_bytes (r, buf, byte_cnt);
+  else
+    return read_bytes_zlib (r, buf, byte_cnt);
+}
+
+/* Reads a 64-bit floating-point number from R and returns its
+   value in host format. */
+static double
+read_compressed_float (struct sfm_reader *r)
+{
+  uint8_t number[8];
+  read_compressed_bytes (r, number, sizeof number);
+  return float_get_double (r->float_format, number);
+}
+
 static const struct casereader_class sys_file_casereader_class =
   {
     sys_file_casereader_read,
diff --git a/src/data/sys-file-reader.h b/src/data/sys-file-reader.h
index 037d33a..52457a0 100644
--- a/src/data/sys-file-reader.h
+++ b/src/data/sys-file-reader.h
@@ -26,6 +26,14 @@
 
 /* Reading system files. */
 
+/* System file compression format. */
+enum sfm_compression
+  {
+    SFM_COMP_NONE,              /* No compression. */
+    SFM_COMP_SIMPLE,            /* Bytecode compression of integer values. */
+    SFM_COMP_ZLIB               /* ZLIB "deflate" compression. */
+  };
+
 /* System file info that doesn't fit in struct dictionary.
 
    The strings in this structure are encoded in UTF-8.  (They are normally in
@@ -36,7 +44,7 @@ struct sfm_read_info
     char *creation_time;       /* "hh:mm:ss". */
     enum integer_format integer_format;
     enum float_format float_format;
-    bool compressed;           /* 0=no, 1=yes. */
+    enum sfm_compression compression;
     casenumber case_cnt;        /* -1 if unknown. */
     char *product;             /* Product name. */
     char *product_ext;          /* Extra product info. */
diff --git a/src/language/dictionary/sys-file-info.c 
b/src/language/dictionary/sys-file-info.c
index 3327a2c..c7f326f 100644
--- a/src/language/dictionary/sys-file-info.c
+++ b/src/language/dictionary/sys-file-info.c
@@ -150,10 +150,11 @@ cmd_sysfile_info (struct lexer *lexer, struct dataset *ds 
UNUSED)
                ? var_get_name (weight_var) : _("Not weighted.")));
   }
 
-  tab_text (t, 0, r, TAB_LEFT, _("Mode:"));
+  tab_text (t, 0, r, TAB_LEFT, _("Compression:"));
   tab_text_format (t, 1, r++, TAB_LEFT,
-                   _("Compression %s."), info.compressed ? _("on") : _("off"));
-
+                   info.compression == SFM_COMP_NONE ? _("None")
+                   : info.compression == SFM_COMP_SIMPLE ? "SAV"
+                   : "ZSAV");
 
   tab_text (t, 0, r, TAB_LEFT, _("Charset:"));
   tab_text (t, 1, r++, TAB_LEFT, dict_get_encoding (d));
diff --git a/utilities/pspp-dump-sav.c b/utilities/pspp-dump-sav.c
index c6b5823..8eaf836 100644
--- a/utilities/pspp-dump-sav.c
+++ b/utilities/pspp-dump-sav.c
@@ -39,6 +39,13 @@
 
 #define ID_MAX_LEN 64
 
+enum compression
+  {
+    COMP_NONE,
+    COMP_SIMPLE,
+    COMP_ZLIB
+  };
+
 struct sfm_reader
   {
     const char *file_name;
@@ -52,7 +59,7 @@ struct sfm_reader
     enum integer_format integer_format;
     enum float_format float_format;
 
-    bool compressed;
+    enum compression compression;
     double bias;
   };
 
@@ -87,7 +94,8 @@ static void read_long_string_missing_values (struct 
sfm_reader *r,
                                              size_t size, size_t count);
 static void read_unknown_extension (struct sfm_reader *,
                                     size_t size, size_t count);
-static void read_compressed_data (struct sfm_reader *, int max_cases);
+static void read_simple_compressed_data (struct sfm_reader *, int max_cases);
+static void read_zlib_compressed_data (struct sfm_reader *);
 
 static struct text_record *open_text_record (
   struct sfm_reader *, size_t size);
@@ -180,7 +188,7 @@ main (int argc, char *argv[])
       r.n_var_widths = 0;
       r.allocated_var_widths = 0;
       r.var_widths = 0;
-      r.compressed = false;
+      r.compression = COMP_NONE;
 
       if (argc - optind > 1)
         printf ("Reading \"%s\":\n", r.file_name);
@@ -218,8 +226,13 @@ main (int argc, char *argv[])
               (long long int) ftello (r.file),
               (long long int) ftello (r.file) + 4);
 
-      if (r.compressed && max_cases > 0)
-        read_compressed_data (&r, max_cases);
+      if (r.compression == COMP_SIMPLE)
+        {
+          if (max_cases > 0)
+            read_simple_compressed_data (&r, max_cases);
+        }
+      else if (r.compression == COMP_ZLIB)
+        read_zlib_compressed_data (&r);
 
       fclose (r.file);
     }
@@ -241,11 +254,16 @@ read_header (struct sfm_reader *r)
   char creation_date[10];
   char creation_time[9];
   char file_label[65];
+  bool zmagic;
 
   read_string (r, rec_type, sizeof rec_type);
   read_string (r, eye_catcher, sizeof eye_catcher);
 
-  if (strcmp ("$FL2", rec_type) != 0)
+  if (!strcmp ("$FL2", rec_type))
+    zmagic = false;
+  else if (!strcmp ("$FL3", rec_type))
+    zmagic = true;
+  else
     sys_error (r, "This is not an SPSS system file.");
 
   /* Identify integer format. */
@@ -265,7 +283,24 @@ read_header (struct sfm_reader *r)
   weight_index = read_int (r);
   ncases = read_int (r);
 
-  r->compressed = compressed != 0;
+  if (!zmagic)
+    {
+      if (compressed == 0)
+        r->compression = COMP_NONE;
+      else if (compressed == 1)
+        r->compression = COMP_SIMPLE;
+      else if (compressed != 0)
+        sys_error (r, "SAV file header has invalid compression value "
+                   "%"PRId32".", compressed);
+    }
+  else
+    {
+      if (compressed == 2)
+        r->compression = COMP_ZLIB;
+      else
+        sys_error (r, "ZSAV file header has invalid compression value "
+                   "%"PRId32".", compressed);
+    }
 
   /* Identify floating-point format and obtain compression bias. */
   read_bytes (r, raw_bias, sizeof raw_bias);
@@ -289,7 +324,12 @@ read_header (struct sfm_reader *r)
   printf ("File header record:\n");
   printf ("\t%17s: %s\n", "Product name", eye_catcher);
   printf ("\t%17s: %"PRId32"\n", "Layout code", layout_code);
-  printf ("\t%17s: %"PRId32"\n", "Compressed", compressed);
+  printf ("\t%17s: %"PRId32" (%s)\n", "Compressed",
+          compressed,
+          r->compression == COMP_NONE ? "no compression"
+          : r->compression == COMP_SIMPLE ? "simple compression"
+          : r->compression == COMP_ZLIB ? "ZLIB compression"
+          : "<error>");
   printf ("\t%17s: %"PRId32"\n", "Weight index", weight_index);
   printf ("\t%17s: %"PRId32"\n", "Number of cases", ncases);
   printf ("\t%17s: %g\n", "Compression bias", r->bias);
@@ -1170,7 +1210,7 @@ read_variable_attributes (struct sfm_reader *r, size_t 
size, size_t count)
 }
 
 static void
-read_compressed_data (struct sfm_reader *r, int max_cases)
+read_simple_compressed_data (struct sfm_reader *r, int max_cases)
 {
   enum { N_OPCODES = 8 };
   uint8_t opcodes[N_OPCODES];
@@ -1258,6 +1298,87 @@ read_compressed_data (struct sfm_reader *r, int 
max_cases)
         }
     }
 }
+
+static void
+read_zlib_compressed_data (struct sfm_reader *r)
+{
+  long long int ofs;
+  long long int this_ofs, next_ofs, next_len;
+  long long int bias, zero;
+  long long int running_uncmp_ofs, running_cmp_ofs;
+  unsigned int block_size, n_blocks;
+  unsigned int i;
+
+  read_int (r);
+  ofs = ftello (r->file);
+  printf ("\n%08llx: ZLIB compressed data header:\n", ofs);
+
+  this_ofs = read_int64 (r);
+  next_ofs = read_int64 (r);
+  next_len = read_int64 (r);
+
+  printf ("\tzheader_ofs: 0x%llx\n", this_ofs);
+  if (this_ofs != ofs)
+    printf ("\t\t(Expected 0x%llx.)\n", ofs);
+  printf ("\tztrailer_ofs: 0x%llx\n", next_ofs);
+  printf ("\tztrailer_len: %lld\n", next_len);
+  if (next_len < 24 || next_len % 24)
+    printf ("\t\t(Trailer length is not a positive multiple of 24.)\n");
+
+  printf ("\n%08llx: 0x%llx bytes of ZLIB compressed data\n",
+          ofs + 8 * 3, next_ofs - (ofs + 8 * 3));
+
+  skip_bytes (r, next_ofs - (ofs + 8 * 3));
+
+  printf ("\n%08llx: ZLIB trailer fixed header:\n", next_ofs);
+  bias = read_int64 (r);
+  zero = read_int64 (r);
+  block_size = read_int (r);
+  n_blocks = read_int (r);
+  printf ("\tbias: %lld\n", bias);
+  printf ("\tzero: 0x%llx\n", zero);
+  if (zero != 0)
+    printf ("\t\t(Expected 0.)\n");
+  printf ("\tblock_size: 0x%x\n", block_size);
+  if (block_size != 0x3ff000)
+    printf ("\t\t(Expected 0x3ff000.)\n");
+  printf ("\tn_blocks: %u\n", n_blocks);
+  if (n_blocks != next_len / 24 - 1)
+    printf ("\t\t(Expected %llu.)\n", next_len / 24 - 1);
+
+  running_uncmp_ofs = ofs;
+  running_cmp_ofs = ofs + 24;
+  for (i = 0; i < n_blocks; i++)
+    {
+      long long int blockinfo_ofs = ftello (r->file);
+      unsigned long long int uncompressed_ofs = read_int64 (r);
+      unsigned long long int compressed_ofs = read_int64 (r);
+      unsigned int uncompressed_size = read_int (r);
+      unsigned int compressed_size = read_int (r);
+
+      printf ("\n%08llx: ZLIB block descriptor %d\n", blockinfo_ofs, i + 1);
+
+      printf ("\tuncompressed_ofs: 0x%llx\n", uncompressed_ofs);
+      if (i == 0 && uncompressed_ofs != running_uncmp_ofs)
+        printf ("\t\t(Expected 0x%llx.)\n", ofs);
+
+      printf ("\tcompressed_ofs: 0x%llx\n", compressed_ofs);
+      if (i == 0 && compressed_ofs != running_cmp_ofs)
+        printf ("\t\t(Expected 0x%llx.)\n", ofs + 24);
+
+      printf ("\tuncompressed_size: 0x%x\n", uncompressed_size);
+      if (i < n_blocks - 1 && uncompressed_size != block_size)
+        printf ("\t\t(Expected 0x%x.)\n", block_size);
+
+      printf ("\tcompressed_size: 0x%x\n", compressed_size);
+      if (i == n_blocks - 1 && compressed_ofs + compressed_size != next_ofs)
+        printf ("\t\t(This was expected to be 0x%llx.)\n",
+                next_ofs - compressed_size);
+
+      running_uncmp_ofs += uncompressed_size;
+      running_cmp_ofs += compressed_size;
+    }
+}
 
 /* Helpers for reading records that consist of structured text
    strings. */
-- 
1.7.10.4


_______________________________________________
pspp-dev mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/pspp-dev

first draft ZSAV implementation

Reply via email to