On 2022-02-22, Rich Freeman wrote:
> On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards
> wrote:
>>
>> But I was trying to figure out a way to do it without uncompressing
>> and recompressing the data. I had hoped that the gzip header would
>> contain a "length" field (so I would know how many bytes to copy using
>> dd), but it does not. Apparently, the only way to find the end of the
>> compressed data is to parse it using the proper algorithm (deflate, in
>> this case).
>
> I'm guessing that the reason it lacks such a header, is precisely so
> that you can use it in a stream in just this manner. In order to
> have a length in the header it would need to be able to seek back to
> the start of the file to modify the header, which isn't always
> possible.
Indeed. It's clearly designed to be used on non-seekable media/devices
like pipes and tapes. I should have realized that would be the case
and would preclude a length field in the header.
> I wouldn't be surprised if it stores some kind of metadata at the end
> of the file, but of course you can only find that if the end of the
> file is marked in some way.
The gzip file format has a length and CRC field in a trailer at the
end (after the compressed data). But, the only way to locate the end
is to parse the data using the appropriate decompression algorithm.
The header allows for multiple algorithms, but only one (deflate) is
actually defined.
> If you google the details of the gzip file format
I did -- link is below.
> you might be able to figure out how to identify the end of the file,
> scan the image to find this marker,
I'm pretty sure the only way to find the end of the file is to parse
the compressed data payload itself. There isn't a marker.
> and then use dd to extract just the desired range. Unless the file
> is VERY large I suspect that is going to take you longer than just
> recompressing it all.
Definitely. It's purely an academic question at this point.
> I can't imagine that there is any way around sequentially reading
> the entire file to find the end,
I believe you're right.
> unless you have some mechanism that can read a random block and
> determine if it is valid gzip data and if so you can do a binary
> search assuming the data on the drive past the end of the file isn't
> valid gzip.
I don't think that determining if something is valid deflate data is
easy (and may be impossible in the general case). I implemented the
deflate algorithm from scratch once a few years ago, and vaguely
recall that you can usually deflate almost anything. It turns out
that the flash drive I used was pretty new, and almost all 0x00
bytes. Once I knew where to look it was pretty obvious where the gzip
data ended.
I've copied it the easy way (zcat | gzip -c), and verified that the
copy matches byte-for-byte except for the MTIME field in the gzip
header. It appears that gzipping stdin produces an empty MTIME
field. No surprise there.
gzip file format:
https://datatracker.ietf.org/doc/html/rfc1952