Thank you, Nathan. Informative.
The "Total bytes written:" was identical with and without the --sparse option (right down to the
last byte ;-) ). It was the time taken to arrive at that estimate that was so very different:
Total bytes written: 2086440960 (2.0GiB, 11MiB/s)
real 3m14.91s
Total bytes written: 2086440960 (2.0GiB, 17GiB/s)
real 0m0.57s
However, if I do an `ls -sl` on the directory and multiply the first column by 512, that does not
quite match the length in bytes column. It is the same order of magnitude, but they are slightly
different. I'm not sure what causes that, but I don't think the tif files are really sparse in the
usual sense of that. Any imaginable gain in efficiency with regard to space would be minimal, and
the cost in time is ridiculous.
Here is an example of one directory:
marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE# ls -sl
total 4072318
410608 -rw-rw---- 1 ariehtal herbarum 210246048 Dec 10 11:04 AC00312847.tif
402936 -rw-rw---- 1 ariehtal herbarum 206423224 Dec 5 16:09 AC00312848.tif
412398 -rw-rw---- 1 ariehtal herbarum 211246700 Dec 5 16:16 AC00312849.tif
405493 -rw-rw---- 1 ariehtal herbarum 207676904 Dec 12 11:52 AC00312850.tif
408052 -rw-rw---- 1 ariehtal herbarum 209052412 Dec 5 15:13 AC00312937.tif
412909 -rw-rw---- 1 ariehtal herbarum 211451884 Dec 5 15:35 AC00312939.tif
415468 -rw-rw---- 1 ariehtal herbarum 212788668 Dec 12 11:46 AC00312940.tif
390142 -rw-rw---- 1 ariehtal herbarum 199753780 Nov 13 11:28
AC00312941-sj0.tif
406004 -rw-rw---- 1 ariehtal herbarum 207925584 Dec 10 11:17 AC00312942.tif
408308 -rw-rw---- 1 ariehtal herbarum 209102728 Dec 10 11:28 AC00312943.tif
marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE#
On 4/12/13 3:41 PM, Nathan Stratton Treadway wrote:
On Fri, Apr 12, 2013 at 12:59:39 -0400, Chris Hoogendyk wrote:
As a followup, in case anyone cares to discuss technicalities and
examples, has anyone run into this before? It seems any site doing
lots of sizable scanned images, or GIS systems with tiff maps, would
have run into it. I don't know how often sparse file treatment is an
important thing. Database files can be sparse, but proper procedure
is to use the database tools (e.g. mysqldump) for backups and not to
just backup the data directory. It's not clear to me exactly what
gnutar is doing with sparse or why it is so inefficient (timewise).
I don't think these tif files are sparse. They are just large. And
gnutar is not just doubling the time as described in
http://www.gnu.org/software/tar/manual/html_node/sparse.html. I was
experiencing on the order of 400 times as much time for the sparse
option compared to when I removed the sparse option.
I have been meaning to reply to your earlier messages but haven't had a
chance to finish the background research I wanted to do first;
meanwhile, a few quick comments and questions:
* When you did your manual test runs with and without --sparse, did the
estimated sizes shown at the end of the run change any?
* I'll have to go back and see if things were any different with GNU tar
v1.23, but when I was looking at the latest version's source code it,
it was clear that at least the intention was that using --sparse would
only change behavior when the input files are sparse -- so I am
curious to know for sure if your tiff files are actually sparse.
The check that tar uses to decide this is to see if the inode's block
count times the block size for the filesystem is less than the inode's
listed file size. (That is, does the file have less space allocated
than its listed size?)
Here are a few ways I've used in the past to search for sparse files:
- if you have GNU "ls" installed on this system:
ls -sl --block-size=1
and then check to see if the number in the first colum is smaller
than the number in the "file size" column.
- if you have GNU "stat" installed you can run
stat -c "%n: alloc: %b * %B size: %s"
, and then check to see if the %b times %B value is less than the %s
value.
- using the standard Sun "ls", you can do
ls -sl
, and then multiply the value in the first column by 512. (I assume
the "block size" used is a constant 512 in that case, regardless of
file system.)
* The doubling of the time mentioned in the man page is in the context
of making an actual dump, but the slowdown is much worse for the
estimate phase. That's because normally when tar notices that the
output file is /dev/null, it realizes that you don't actually want the
data from the input files, and thus doesn't actually read through
their contents, but simply looks at the file size (from the inode
information) and adds that to the dump-size tally before moving on to
the next file. So the time spend during the estimate is almost
entirely due to reading through the directory tree, and won't depend
on the size of the files in question.
In the case of a file that's actually sparse, though, if the --sparse
option is enabled then tar has to actually read in the entire file to
see how much of it is zero blocks. So, if many of your files are
indeed actually sparse, then what will happen is that the estimate
time will be about the same as the actual dump time, rather than the
usual much-shorter estimate time.
Nathan
----------------------------------------------------------------------------
Nathan Stratton Treadway - [email protected] - Mid-Atlantic region
Ray Ontko & Co. - Software consulting services - http://www.ontko.com/
GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239
Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
--
---------------
Chris Hoogendyk
-
O__ ---- Systems Administrator
c/ /'_ --- Biology & Geology Departments
(*) \(*) -- 140 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst
<[email protected]>
---------------
Erdös 4