On Tue, Nov 17, 2009 at 9:28 AM, Dave Angel <da...@ieee.org> wrote: > I'm pretty sure that the ZIP format uses independent compression for each > contained file (member). You can add and remove members from an existing > ZIP, and use several different compression methods within the same file. So > the adaptive tables start over for each new member.
This is correct. It doesn't do solid compression, which is what you get with .tar.gz (and RARs, optionally). > What isn't so convenient is that the sizes are apparently at the end. So if > you're trying to unzip "over the wire" you can't readily do it without > somehow seeking to the end. That same feature is a good thing when it comes > to spanning zip files across multiple disks. Actually, there are two copies of the headers: one immediately before the file data (the local file header), and one at the end (the central directory); both contain copies of the compressed and uncompressed file size. Very few programs actually use the local file headers, but it's very nice to have the option. It also helps makes ZIPs very recoverable. If you've ever run a ZIP recovery tool, they're usually just reconstructing the central directory from the local file headers (and probably recomputing the CRCs). (This is no longer true if bit 3 of the bitflags is set, which puts the CRC and filesizes after the data. In that case, it's not possible to stream data--largely defeating the benefit of the local headers.) > Define a calls to read _portions_ of the raw (compressed, encrypted, > whatever) data. I think the clean way is to return a file-like object for a specified file, eg.: # Read raw bytes 1024-1152 from each file in the ZIP: zip = ZipFile("file.zip", "r") for info in zip.infolist(): f = zip.rawopen(info) # or a filename f.seek(1024) f.read(128) > Define a call that locks the ZipFile object and returns a write handle for a > single new file. I'd use a file-like object here, too, for probably obvious reasons--you can pass it to anything expecting a file object to write data to (eg. shutil.copyfile). > Only on successful close of the "write handle" is the new directory written. Rather, when the new file is closed, its directory entry is saved to ZipFile.filelist. The new directory on disk should be written when the zip's own close() method is called, just as when writing files with the other methods. Otherwise, writing lots of files in this way would write and overwrite the central directory repeatedly. Any thoughts about this rough API outline: ZipFile.rawopen(zinfo_or_arcname) Same definition as open(), but returns the raw data. No mode (no newline translation for raw files); no pwd (raw files aren't decrypted). ZipFile.writefile(zinfo[, raw]) Definition like ZipInfo.writestr. Relax writestr()'s "at least the filename, date, and time must be given" rule: if not specified, use the current date and time. Returns a file-like object (ZipWriteFile) which file data is written to. If raw is True, no actual compression is performed, and the file data should already be compressed with the specified compression type (no checking is performed). If raw is False (the default), the data will be compressed before being written. When finished writing data, the file must be closed. Only one ZipWriteFile may be open for each ZipFile at a time. Calls to ZipFile.writefile while a ZipWriteFile is already open will result in ValueError[1]. Another detail: is the CRC recomputed when writing in raw mode? No. If I delete a file from a ZIP (causing me to rewrite the ZIP) and another file in the ZIP is corrupt, it should just move the file as-is, invalid CRC and all; it should not rewrite the file with a new CRC (masking the corruption) or throw an error (I should not get errors about file X being corrupt if I'm deleting file Y). When writing in raw mode, if zinfo.CRC is already specified (not None), it should be used as-is. I don't like how this results in three different APIs for adding data (write, writestr, writefile), but trying to squeeze the APIs together feels unnatural--the parameters don't really line up too well. I'd expect the other two to become thin wrappers around ZipFile.writefile(). This never opens files directly like ZipFile.write, so it only takes a zinfo and not a filename (set the filename through the ZipInfo). Now you can stream data into a ZIP, specify all metadata for the file, and you can stream in compressed data from another ZIP (for deleting files and other cases) without recompressing. This also means you can do all of these things to encrypted files without the password, and to files compressed with unknown methods, which is currently impossible. > and I realize that the big flaw in this design is that from the moment you > start overwriting the existing master directory until you write a new master at the end, your do not have a valid zip file. The same is true when appending to a ZIP with ZipFile.write(); until it finishes, the file on disk isn't a valid ZIP. That's unavoidable. Files in the ZIP can still be opened by the existing ZipFile object, since it keeps the central directory in memory. For what it's worth, I've written ZIP parsing code several times over the years (https://svn.stepmania.com/svn/trunk/stepmania/src/RageFileDriverZip.cpp), so I'm familiar with the more widely-used parts of the file format, but I havn't dealt with ZIP writing very much. I'm not sure if I'll have time to get to this soon, but I'll keep thinking about it. [1] seems odd, but mimicing http://docs.python.org/library/stdtypes.html#file.close -- Glenn Maynard -- http://mail.python.org/mailman/listinfo/python-list