Hi, Following a discussion with 'velix' on IRC who pointed to me the pigz utility ( https://zlib.net/pigz/ ) that does multi-threaded compression of gzip files, I've committed in GDAL master a similar mechanism in the /vsigzip/ and /vsizip/ virtual file systems. If you set GDAL_NUM_THREADS to a value greater than 1 or to ALL_CPUS, multi-threaded DEFLATE compression will be done. This uses the equivalent of the pigz independent mode, where uncompressed chunks (of size 1 megabyte by default) are compressed in an independent way [1], and compressed chunks are simply appended. The resulting codestream is perfectly standard. If the reading of the input data is not the limiting factor, this scales quite well with the number of threads.
You can use the following small Python script to test creation of zip files (it enables GDAL_NUM_THREADS=ALL_CPUS by default). https://raw.githubusercontent.com/OSGeo/gdal/master/gdal/swig/python/samples/gdal_zip.py $ python gdal_zip.py my.zip srcfile1 srcfile2 ... (Note that the multi-threaded compression is per file, not parallelized compression of several files at once.) Given that I've opted for the equivalent of pigz independent mode, one drawback is a slight decrease in the compression ratio, due to the clearing of the dictionary, but given the large enough chunk size, this is normally barely noticeable. But the main advantage of independent mode is that independent decompression could potentially be implemented. If we would serialize the offset of each independant chunk, we could implement efficient seeking in the file, whereas, currently if you want to read a byte at the end of a deflate stream, you need to decompress the whole stream. Here you would need to decompress at most 1MB. This could be done by writting a special file with those offsets of independent chunks inside a .zip archive, potentially hidden for other applications (you can have holes in zip). It could also be possible to do multithreaded decompression (if enough uncompressed data is requested at once, or if we detect a read pattern that seems to imply the whole file would be read). If people are interested to fund the implementation of such functionalities, feel free to contact me. Another fix/improvement I did is ZIP64 ([2]) creation. ZIP64 reading was supported, but up to now, if using /vsizip/ in write mode and the uncompressed or compressed size of a file was greater than 4GB, or the whole .zip was > 4 GB, the resulting .zip would be corrupted (file sizes and internal offsets truncated to their lower 32 bit part), due to ZIP64 not being used. I've thus resynchronized with zlib' minizip to fix that. One potential downside is that given how GDAL creates zip file and zip file structural constraints, GDAL must add a ZIP64 extra field in the "local file header", even if it is eventually unused. The unzip utility on Linux and the Windows file manager of Windows 7 are happy with that, but I'd appreciate if some testing could be done by users with other zip readers (on Mac particularly that apparently may have issues with ZIP64). You can try opening this smallish zip archive: https://github.com/OSGeo/gdal/blob/master/autotest/gcore/data/byte_zip64_local_header_zeroed.zip?raw=true Even [1] pigz in standard mode can also compress in a multithreaded way, with non-initial chunks depending on the history of the last 32 KB of the preceding uncompressed chunk. The resulting stream is thus nearly as small as classical compression, but independent decompression of the chunks is not possible. In independent mode, a full synchronization marker terminates each compressed chunk and the decompressor clears its dictionary. [2] not to be confused with Deflate64, a proprietary variant of Deflate, that some Windows versions unfortunately and unnecessarily use for files > 4 GB, and which is unsupported by zlib. -- Spatialys - Geospatial professional services http://www.spatialys.com _______________________________________________ gdal-dev mailing list [email protected] https://lists.osgeo.org/mailman/listinfo/gdal-dev
