Re: [Rd] inflate zlib compressed data using base R or CRAN package?

Simon Urbanek Fri, 29 Nov 2013 07:20:07 -0800

On Nov 29, 2013, at 4:37 AM, Henrik Bengtsson <h...@biostat.ucsf.edu> wrote:


> On Thu, Nov 28, 2013 at 4:48 PM, Simon Urbanek
> <simon.urba...@r-project.org> wrote:
>> On Nov 27, 2013, at 8:30 PM, Murray Stokely <mur...@stokely.org> wrote:
>> 
>>> I think none of these examples describe a zlib compressed data block inside 
>>> a binary file that the OP asked about, as all of your examples are e.g. 
>>> prepending gzip or zip headers.
>>> 
>>> Greg, is memDecompress what you are looking for?
>>> 
>> 
>> I think so.
>> 
>> But this is interesting — I think the documentation of 
>> memCompress/memDecompress is not quite correct and the parameters are 
>> misleading. Although it does mention the gzip headers, it is incorrect since 
>> zlib format is not a subset of the gzip format (albeit they use the same 
>> compression method), so you cannot extract gzip content using zlib 
>> decompression - you’ll get  internal error -3 in memDecompress(2) if you try 
>> it since it expects the zlib header which is different form the gzip one.
> 
> Interestingly.  Just to make sure: are you 100% certain about this?

Yes, see below.


>> From the http://svn.r-project.org/R/trunk/src/main/connections.c:
> 
>    case 2: /* gzip */
>    {
>       uLong inlen = LENGTH(from), outlen = 3*inlen;
>       int res;
>       Bytef *buf, *p = (Bytef *)RAW(from);
>       /* we check for a file header */
>       if (p[0] == 0x1f && p[1] == 0x8b) { p += 2; inlen -= 2; }
>       while(1) {
>           buf = (Bytef *) R_alloc(outlen, sizeof(Bytef));
>           res = uncompress(buf, &outlen, p, inlen);
>           if(res == Z_BUF_ERROR) { outlen *= 2; continue; }
>           if(res == Z_OK) break;
>           error("internal error %d in memDecompress(%d)", res, type);
>       }
>       ans = allocVector(RAWSXP, outlen);
>       memcpy(RAW(ans), buf, outlen);
>       break;
>    }
> 
> That code looks for the 0x1F 0x8B magic number, which is the one for
> gzip [http://www.gzip.org/zlib/rfc-gzip.html#header-trailer].  Or are
> you saying that that if statement is incorrect?  (Disclaimer: I don't
> know much about gzip/zlib, but I happens to recognize that gzip magic
> number.)
> 

The above assumes that zlib is a subset of gzip which is *not* true - that was 
the point I was making. zlibs has *different* headers than gzip, not just fewer 
bytes. gzip has lots of other things in the header and they even also use 
different CRC methods. 

To illustrate:

> writeBin(charToRaw("1234"), f<-gzfile("test.gz","wb"))
> close(f)
> readBin("test.gz",raw(),100)
 [1] 1f 8b 08 00 00 00 00 00 00 03 33 34 32 36 01
[16] 00 a3 e0 e3 9b 04 00 00 00
> memCompress("1234")
 [1] 78 9c 33 34 32 36 01 00 01 f8 00 cb

As you can see gzip uses a different header (it starts with 0x1f 0x8b but then 
has many other files like mod time etc.) - the compressed payload starts at 
byte 11 and the CRC is 64-bit wide. In contrast, zlib has no magic header but 
it also has just two-byte header followed by the payload (starting at byte 3) 
and 32-bit CRC. So the two are entirely incompatible - you cannot decompress 
gzip format with zlib parser and vice-versa. The payload is the same, but the 
headers and trailers are entirely different. That's why Greg was specifically 
asking about zlib which does *not* mean gzip.

Cheers,
Simon




> /Henrik
> 
>> So “gzip” in type is a misnomer - it should say “zlib” since it can neither 
>> read nor write the gzip format. Also the documentation should make it clear 
>> since it’s pointless to try to use this on gzip contents. The better 
>> alternative would be to support both gzip and zlib since R can deal with 
>> both — the issue is that it will break code that used type=“gzip” explicitly 
>> to mean “zlib” so I’m not sure there is a good way out.
>> 
>> Cheers,
>> Simon
>> 
>> 
>>> 
>>> On Wed, Nov 27, 2013 at 5:22 PM, Dirk Eddelbuettel <e...@debian.org> wrote:
>>> 
>>>> 
>>>> On 27 November 2013 at 18:38, Dirk Eddelbuettel wrote:
>>>> |
>>>> | On 27 November 2013 at 23:49, Dr Gregory Jefferis wrote:
>>>> | | I have a binary file type that includes a zlib compressed data block
>>>> (ie
>>>> | | not gzip). Is anyone aware of a way using base R or a CRAN package to
>>>> | | decompress this kind of data (from disk or memory). So far I have found
>>>> | | Rcompression::decompress on omegahat, but I would prefer to keep
>>>> | | dependencies on CRAN (or bioconductor). I am also trying to avoid
>>>> | | writing yet another C level interface to part of zlib.
>>>> |
>>>> | Unless I am missing something, this is in base R; see help(connections).
>>>> |
>>>> | Here is a quick demo:
>>>> |
>>>> | R> write.csv(trees, file="/tmp/trees.csv")    # data we all have
>>>> | R> system("gzip -v /tmp/trees.csv")           # as I am lazy here
>>>> | /tmp/trees.csv:        50.5% -- replaced with /tmp/trees.csv.gz
>>>> | R> read.csv(gzfile("/tmp/trees.csv.gz"))      # works out of the box
>>>> 
>>>> Oh, and in case you meant zip file containing a data file, that also works.
>>>> 
>>>> First converting what I did last
>>>> 
>>>> edd@max:/tmp$ gunzip trees.csv.gz
>>>> edd@max:/tmp$ zip trees.zip trees.csv
>>>> adding: trees.csv (deflated 50%)
>>>> edd@max:/tmp$
>>>> 
>>>> Then reading the csv from inside the zip file:
>>>> 
>>>> R> read.csv(unz("/tmp/trees.zip", "trees.csv"))
>>>>   X Girth Height Volume
>>>> 1   1   8.3     70   10.3
>>>> 2   2   8.6     65   10.3
>>>> 3   3   8.8     63   10.2
>>>> 4   4  10.5     72   16.4
>>>> 5   5  10.7     81   18.8
>>>> 6   6  10.8     83   19.7
>>>> 7   7  11.0     66   15.6
>>>> 8   8  11.0     75   18.2
>>>> 9   9  11.1     80   22.6
>>>> 10 10  11.2     75   19.9
>>>> 11 11  11.3     79   24.2
>>>> 12 12  11.4     76   21.0
>>>> 13 13  11.4     76   21.4
>>>> 14 14  11.7     69   21.3
>>>> 15 15  12.0     75   19.1
>>>> 16 16  12.9     74   22.2
>>>> 17 17  12.9     85   33.8
>>>> 18 18  13.3     86   27.4
>>>> 19 19  13.7     71   25.7
>>>> 20 20  13.8     64   24.9
>>>> 21 21  14.0     78   34.5
>>>> 22 22  14.2     80   31.7
>>>> 23 23  14.5     74   36.3
>>>> 24 24  16.0     72   38.3
>>>> 25 25  16.3     77   42.6
>>>> 26 26  17.3     81   55.4
>>>> 27 27  17.5     82   55.7
>>>> 28 28  17.9     80   58.3
>>>> 29 29  18.0     80   51.5
>>>> 30 30  18.0     80   51.0
>>>> 31 31  20.6     87   77.0
>>>> R>
>>>> 
>>>> Regards, Dirk
>>>> 
>>>> --
>>>> Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com
>>>> 
>>>> ______________________________________________
>>>> R-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>> 
>>>      [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] inflate zlib compressed data using base R or CRAN package?

Reply via email to