On Fri, Apr 12, 2013 at 12:59:39 -0400, Chris Hoogendyk wrote:
> As a followup, in case anyone cares to discuss technicalities and
> examples, has anyone run into this before? It seems any site doing
> lots of sizable scanned images, or GIS systems with tiff maps, would
> have run into it. I don't know how often sparse file treatment is an
> important thing. Database files can be sparse, but proper procedure
> is to use the database tools (e.g. mysqldump) for backups and not to
> just backup the data directory. It's not clear to me exactly what
> gnutar is doing with sparse or why it is so inefficient (timewise).
> I don't think these tif files are sparse. They are just large. And
> gnutar is not just doubling the time as described in
> http://www.gnu.org/software/tar/manual/html_node/sparse.html. I was
> experiencing on the order of 400 times as much time for the sparse
> option compared to when I removed the sparse option.
I have been meaning to reply to your earlier messages but haven't had a
chance to finish the background research I wanted to do first;
meanwhile, a few quick comments and questions:
* When you did your manual test runs with and without --sparse, did the
estimated sizes shown at the end of the run change any?
* I'll have to go back and see if things were any different with GNU tar
v1.23, but when I was looking at the latest version's source code it,
it was clear that at least the intention was that using --sparse would
only change behavior when the input files are sparse -- so I am
curious to know for sure if your tiff files are actually sparse.
The check that tar uses to decide this is to see if the inode's block
count times the block size for the filesystem is less than the inode's
listed file size. (That is, does the file have less space allocated
than its listed size?)
Here are a few ways I've used in the past to search for sparse files:
- if you have GNU "ls" installed on this system:
ls -sl --block-size=1
and then check to see if the number in the first colum is smaller
than the number in the "file size" column.
- if you have GNU "stat" installed you can run
stat -c "%n: alloc: %b * %B size: %s"
, and then check to see if the %b times %B value is less than the %s
value.
- using the standard Sun "ls", you can do
ls -sl
, and then multiply the value in the first column by 512. (I assume
the "block size" used is a constant 512 in that case, regardless of
file system.)
* The doubling of the time mentioned in the man page is in the context
of making an actual dump, but the slowdown is much worse for the
estimate phase. That's because normally when tar notices that the
output file is /dev/null, it realizes that you don't actually want the
data from the input files, and thus doesn't actually read through
their contents, but simply looks at the file size (from the inode
information) and adds that to the dump-size tally before moving on to
the next file. So the time spend during the estimate is almost
entirely due to reading through the directory tree, and won't depend
on the size of the files in question.
In the case of a file that's actually sparse, though, if the --sparse
option is enabled then tar has to actually read in the entire file to
see how much of it is zero blocks. So, if many of your files are
indeed actually sparse, then what will happen is that the estimate
time will be about the same as the actual dump time, rather than the
usual much-shorter estimate time.
Nathan
----------------------------------------------------------------------------
Nathan Stratton Treadway - [email protected] - Mid-Atlantic region
Ray Ontko & Co. - Software consulting services - http://www.ontko.com/
GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239
Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239