Hi, Currently, tar seems to perform quite sub-optimally when archiving sparse files. I compared the performance of GNU tar and star when archiving a large (~2TB) sparse file. All but about 180MB of the file was holes.
The archives created by star and GNU tar were of identical size, and both programs could extract each correctly. Extraction times were similar (both under 4 seconds). However, GNU tar took about 2.8 times as long as star to create the archive. Using star: $ star -no-fifo -v -c f=output_xustar.tar artype=xustar -numeric -sparse image.bin image.bin is sparse a image.bin 2000398934016 bytes, 361539 tape blocks star: 18078 blocks + 0 bytes (total of 185118720 bytes = 180780.00k). Using GNU tar: $ tar --verbose --sparse --format=posix --create --file=output_posix.tar image.bin image.bin star took 42:07; 2527 secs, effective throughput 754.9MB/sec. GNU tar took 116:18; 6978 secs, effective throughput 273.4MB/sec. What causes the large difference? Does tar read the file to see where the all-zero regions are, then when writing the non-hole data to the archive, read through the entire file again? (On the second pass it could use the table of hole locations it built to read only the non-hole parts, rather than reading through all the all-zero data and discarding it.) Next a suggestion. star has a "-force-hole" option, which tells it to write all files sparsely on extraction. So even non-sparse files in the archive which contain all-zero regions occupy less disk space when extracted. It would be useful if GNU tar could include a similar option. However, that option could apply to archive creation as well. When --sparse is used, tar only checks sparse files (those which occupy less space than their apparent size) for holes. Non-sparse files, even if they contain all-zero regions, get copied to the archive as-is. The GNU tar manual (page 133) seems to be incorrect/unclear about that: "However, be aware that --sparse option presents a serious drawback. Namely, in order to determine if the file is sparse tar has to read it before trying to archive it, so in total the file is read twice. So, always bear in mind that the time needed to process all files with this option is roughly twice the time needed to archive them without it." Files with large amounts of all-zero data are quite common. For example, some DVD-ROM image files contain hundreds of MB or several GB all-zero data. Typically when creating an image from such a disc, the file is not created sparsely; dd does not have that capability. (As a workaround, the file can be copied sparsely before archiving, e.g. cp --sparse=always file.bin file.bin_sparse.) It would be much better if tar could accept a --force-sparse/--force-hole option, which on archive creation would cause tar to scan *every* input file for all-zero regions. In future, the tar file format could be updated to allow sparse files to be archived in a single pass, but it would require that the archive be seekable. Or alternatively, tar would need a buffer at least as big as the largest non-hole region. (Extraction wouldn't need a seekable archive.) That would work by having hole entries and file data interleaved. In the archive you would have something like this: - header giving apparent file size - flag indicating that this is an "interleaved" sparse file. - (optional) total length of file data as stored in the archive (i.e. offset to next file) - length of 1st non-hole region - 1st non-hole region data - length of 1st hole - length of 2nd non-hole region - 2nd non-hole region data - length of 2nd hole - length of 3rd non-hole region - 3rd non-hole region data - and so on, until you get an EOF marker (a region length field of -1 say). When creating an archive, tar would leave space for the length entry before each non-hole region's data. Once the length is known (i.e. once tar encounters the next all-zero run or EOF), it would seek back and fill in the entry. There could also be an entry for the total length of the file as stored in the archive (again, tar would seek back to fill that in). Having that would allow quickly seeking in the archive to extract arbitrary files, as opposed to having to do multiple seeks to get to the next file. Would there any disadvantages to that approach? Finally, another suggestion for a future update to the tar archive format. I mentioned above that a --force-hole/--force-sparse option could be useful on archive creation. Sometimes it's possible to determine the exact hole layout in a file. Apparently that's quite easy in Solaris (SEEK_HOLE, SEEK_DATA) and Windows, less so in Linux (FIEMAP). Anyway, supposing it is possible in future to portably determine where the actual holes in a file are... Being able to preserve exact sparseness could be useful, but doing that by storing all non-hole all-zero regions in the archive as-is wastes space. Instead there could be two classes of hole recorded in the archive; actual holes and "written all-zero holes". An example of where preserving a file's sparseness could in theory be useful: a virtual machine disk image could be created consisting of 20GB all-zero written data, then a 100GB hole. The initial written data area could be used for the VM boot partition, the rest for other storage. (Hopefully VM disk performance would be better as the boot partition gets filled up, and the VM boot partition would be unfragmented on the host.)
