[Bug-tar] Sparse file performance and suggestions

markk Sun, 06 Feb 2011 04:49:18 -0800

Hi,

Currently, tar seems to perform quite sub-optimally when archiving sparse
files. I compared the performance of GNU tar and star when archiving a
large (~2TB) sparse file. All but about 180MB of the file was holes.


The archives created by star and GNU tar were of identical size, and both
programs could extract each correctly. Extraction times were similar (both
under 4 seconds). However, GNU tar took about 2.8 times as long as star to
create the archive.

Using star:
  $ star -no-fifo -v -c f=output_xustar.tar artype=xustar -numeric -sparse
image.bin
  image.bin is sparse
  a image.bin 2000398934016 bytes, 361539 tape blocks
  star: 18078 blocks + 0 bytes (total of 185118720 bytes = 180780.00k).

Using GNU tar:
  $ tar --verbose --sparse --format=posix --create --file=output_posix.tar
image.bin
  image.bin

star took 42:07; 2527 secs, effective throughput 754.9MB/sec.
GNU tar took 116:18; 6978 secs, effective throughput 273.4MB/sec.

What causes the large difference? Does tar read the file to see where the
all-zero regions are, then when writing the non-hole data to the archive,
read through the entire file again? (On the second pass it could use the
table of hole locations it built to read only the non-hole parts, rather
than reading through all the all-zero data and discarding it.)



Next a suggestion. star has a "-force-hole" option, which tells it to
write all files sparsely on extraction. So even non-sparse files in the
archive which contain all-zero regions occupy less disk space when
extracted. It would be useful if GNU tar could include a similar option.
However, that option could apply to archive creation as well.

When --sparse is used, tar only checks sparse files (those which occupy
less space than their apparent size) for holes. Non-sparse files, even if
they contain all-zero regions, get copied to the archive as-is.

The GNU tar manual (page 133) seems to be incorrect/unclear about that:
  "However, be aware that --sparse option presents a serious drawback.
  Namely, in order to determine if the file is sparse tar has to read it
  before trying to archive it, so in total the file is read twice. So,
  always bear in mind that the time needed to process all files with this
  option is roughly twice the time needed to archive them without it."

Files with large amounts of all-zero data are quite common. For example,
some DVD-ROM image files contain hundreds of MB or several GB all-zero
data. Typically when creating an image from such a disc, the file is not
created sparsely; dd does not have that capability. (As a workaround, the
file can be copied sparsely before archiving, e.g. cp --sparse=always
file.bin file.bin_sparse.)

It would be much better if tar could accept a --force-sparse/--force-hole
option, which on archive creation would cause tar to scan *every* input
file for all-zero regions.



In future, the tar file format could be updated to allow sparse files to
be archived in a single pass, but it would require that the archive be
seekable. Or alternatively, tar would need a buffer at least as big as the
largest non-hole region. (Extraction wouldn't need a seekable archive.)

That would work by having hole entries and file data interleaved. In the
archive you would have something like this:
 - header giving apparent file size
 - flag indicating that this is an "interleaved" sparse file.
 - (optional) total length of file data as stored in the archive (i.e.
offset to next file)
 - length of 1st non-hole region
 - 1st non-hole region data
 - length of 1st hole
 - length of 2nd non-hole region
 - 2nd non-hole region data
 - length of 2nd hole
 - length of 3rd non-hole region
 - 3rd non-hole region data
 - and so on, until you get an EOF marker (a region length field of -1 say).

When creating an archive, tar would leave space for the length entry
before each non-hole region's data. Once the length is known (i.e. once
tar encounters the next all-zero run or EOF), it would seek back and fill
in the entry. There could also be an entry for the total length of the
file as stored in the archive (again, tar would seek back to fill that
in). Having that would allow quickly seeking in the archive to extract
arbitrary files, as opposed to having to do multiple seeks to get to the
next file. Would there any disadvantages to that approach?



Finally, another suggestion for a future update to the tar archive format.
I mentioned above that a --force-hole/--force-sparse option could be
useful on archive creation.

Sometimes it's possible to determine the exact hole layout in a file.
Apparently that's quite easy in Solaris (SEEK_HOLE, SEEK_DATA) and
Windows, less so in Linux (FIEMAP). Anyway, supposing it is possible in
future to portably determine where the actual holes in a file are...

Being able to preserve exact sparseness could be useful, but doing that by
storing all non-hole all-zero regions in the archive as-is wastes space.
Instead there could be two classes of hole recorded in the archive; actual
holes and "written all-zero holes".

An example of where preserving a file's sparseness could in theory be
useful: a virtual machine disk image could be created consisting of 20GB
all-zero written data, then a 100GB hole. The initial written data area
could be used for the VM boot partition, the rest for other storage.
(Hopefully VM disk performance would be better as the boot partition gets
filled up, and the VM boot partition would be unfragmented on the host.)

[Bug-tar] Sparse file performance and suggestions

Reply via email to