Re: [gentoo-user] Re: How to copy gzip data from bytestream?

2022-02-22 Thread Felix Kuperjans

On 2022-02-22, Grant Edwards wrote:

That doesn't work. It shows the size of the drive as the
"uncompressed" size and 0 as compressed:

# gzip -clt  foo
 $ ls -l foo
 -rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo
 
 $ gzip foo

 $ ls -l foo.gz
 -rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz
 
 $ gzip -clt 
  compresseduncompressed  ratio uncompressed_name
6083   12923  53.1% stdout
 
 $ echo asdf >> foo.gz
 
 $ gzip -clt 
  compresseduncompressed  ratio uncompressed_name
6088   174482547 100.0% stdout
 
 $ cat foo.gz | gzip -clt

  compresseduncompressed  ratio uncompressed_name
  -1  -1   0.0% stdout
 
 
 
Here's relevent portion of the strace for the 'gzip -clt 
where it seeks to end-8 and reads what it thinks is the uncompressed
length and the CRC:

 lseek(0, -8, SEEK_END)  = 6080
 read(0, "2\0\0asdf\n", 8)   = 8
 write(1, "   6088   17"..., 54) = 54
 close(0)= 0
 close(1)= 0
 exit_group(0)   = ?


Hi Grant,

you're right it doesn't work with the trailing garbage. I wasn't aware 
it actually seeks even on pipes.


By coincidence it seems the next release will even change this behavior:

https://git.savannah.gnu.org/cgit/gzip.git/commit/?id=cf26200380585019e927fe3cf5c0ecb7c8b3ef14

But this actually still doesn't solve your problem, since this only 
adjust the calculation of the uncompressed size, but the compressed size 
is still derived from stat.





[gentoo-user] Re: How to copy gzip data from bytestream?

2022-02-22 Thread Grant Edwards
On 2022-02-22, Felix Kuperjans  wrote:

> you could use gzip to tell you the compressed size of the file and then 
> use another method to copy just those bytes (dd for example):
>
> gzip -clt 
> Should print the compressed size in bytes, although by reading through 
> the entire stream once.

That doesn't work. It shows the size of the drive as the
"uncompressed" size and 0 as compressed:

# gzip -clt  foo
$ ls -l foo
-rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo

$ gzip foo
$ ls -l foo.gz
-rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz

$ gzip -clt > foo.gz

$ gzip -clt 

[gentoo-user] Re: How to copy gzip data from bytestream?

2022-02-21 Thread Grant Edwards
On 2022-02-22, Rich Freeman  wrote:
> On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards  
> wrote:
>>
>> But I was trying to figure out a way to do it without uncompressing
>> and recompressing the data. I had hoped that the gzip header would
>> contain a "length" field (so I would know how many bytes to copy using
>> dd), but it does not. Apparently, the only way to find the end of the
>> compressed data is to parse it using the proper algorithm (deflate, in
>> this case).
>
> I'm guessing that the reason it lacks such a header, is precisely so
> that you can use it in a stream in just this manner.  In order to
> have a length in the header it would need to be able to seek back to
> the start of the file to modify the header, which isn't always
> possible.

Indeed. It's clearly designed to be used on non-seekable media/devices
like pipes and tapes. I should have realized that would be the case
and would preclude a length field in the header.

> I wouldn't be surprised if it stores some kind of metadata at the end
> of the file, but of course you can only find that if the end of the
> file is marked in some way.

The gzip file format has a length and CRC field in a trailer at the
end (after the compressed data). But, the only way to locate the end
is to parse the data using the appropriate decompression algorithm.
The header allows for multiple algorithms, but only one (deflate) is
actually defined.

> If you google the details of the gzip file format

I did -- link is below.

> you might be able to figure out how to identify the end of the file,
> scan the image to find this marker,

I'm pretty sure the only way to find the end of the file is to parse
the compressed data payload itself. There isn't a marker.

> and then use dd to extract just the desired range.  Unless the file
> is VERY large I suspect that is going to take you longer than just
> recompressing it all.

Definitely. It's purely an academic question at this point.

> I can't imagine that there is any way around sequentially reading
> the entire file to find the end,

I believe you're right.

> unless you have some mechanism that can read a random block and
> determine if it is valid gzip data and if so you can do a binary
> search assuming the data on the drive past the end of the file isn't
> valid gzip.

I don't think that determining if something is valid deflate data is
easy (and may be impossible in the general case). I implemented the
deflate algorithm from scratch once a few years ago, and vaguely
recall that you can usually deflate almost anything.  It turns out
that the flash drive I used was pretty new, and almost all 0x00
bytes. Once I knew where to look it was pretty obvious where the gzip
data ended.

I've copied it the easy way (zcat | gzip -c), and verified that the
copy matches byte-for-byte except for the MTIME field in the gzip
header. It appears that gzipping stdin produces an empty MTIME
field. No surprise there.

gzip file format:

   https://datatracker.ietf.org/doc/html/rfc1952