Thank you, Nathan. Informative.

The "Total bytes written:" was identical with and without the --sparse option (right down to the last byte ;-) ). It was the time taken to arrive at that estimate that was so very different:

Total bytes written: 2086440960 (2.0GiB, 11MiB/s)
real    3m14.91s

Total bytes written: 2086440960 (2.0GiB, 17GiB/s)
real    0m0.57s


However, if I do an `ls -sl` on the directory and multiply the first column by 512, that does not quite match the length in bytes column. It is the same order of magnitude, but they are slightly different. I'm not sure what causes that, but I don't think the tif files are really sparse in the usual sense of that. Any imaginable gain in efficiency with regard to space would be minimal, and the cost in time is ridiculous.

Here is an example of one directory:

marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE# ls -sl

total 4072318
410608 -rw-rw----   1 ariehtal herbarum 210246048 Dec 10 11:04 AC00312847.tif
402936 -rw-rw----   1 ariehtal herbarum 206423224 Dec  5 16:09 AC00312848.tif
412398 -rw-rw----   1 ariehtal herbarum 211246700 Dec  5 16:16 AC00312849.tif
405493 -rw-rw----   1 ariehtal herbarum 207676904 Dec 12 11:52 AC00312850.tif
408052 -rw-rw----   1 ariehtal herbarum 209052412 Dec  5 15:13 AC00312937.tif
412909 -rw-rw----   1 ariehtal herbarum 211451884 Dec  5 15:35 AC00312939.tif
415468 -rw-rw----   1 ariehtal herbarum 212788668 Dec 12 11:46 AC00312940.tif
390142 -rw-rw----   1 ariehtal herbarum 199753780 Nov 13 11:28 
AC00312941-sj0.tif
406004 -rw-rw----   1 ariehtal herbarum 207925584 Dec 10 11:17 AC00312942.tif
408308 -rw-rw----   1 ariehtal herbarum 209102728 Dec 10 11:28 AC00312943.tif

marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE#



On 4/12/13 3:41 PM, Nathan Stratton Treadway wrote:
On Fri, Apr 12, 2013 at 12:59:39 -0400, Chris Hoogendyk wrote:
As a followup, in case anyone cares to discuss technicalities and
examples, has anyone run into this before? It seems any site doing
lots of sizable scanned images, or GIS systems with tiff maps, would
have run into it. I don't know how often sparse file treatment is an
important thing. Database files can be sparse, but proper procedure
is to use the database tools (e.g. mysqldump) for backups and not to
just backup the data directory. It's not clear to me exactly what
gnutar is doing with sparse or why it is so inefficient (timewise).
I don't think these tif files are sparse. They are just large. And
gnutar is not just doubling the time as described in
http://www.gnu.org/software/tar/manual/html_node/sparse.html. I was
experiencing on the order of 400 times as much time for the sparse
option compared to when I removed the sparse option.
I have been meaning to reply to your earlier messages but haven't had a
chance to finish the background research I wanted to do first;
meanwhile, a few quick comments and questions:

* When you did your manual test runs with and without --sparse, did the
   estimated sizes shown at the end of the run change any?

* I'll have to go back and see if things were any different with GNU tar
   v1.23, but when I was looking at the latest version's source code it,
   it was clear that at least the intention was that using --sparse would
   only change behavior when the input files are sparse -- so I am
   curious to know for sure if your tiff files are actually sparse.

   The check that tar uses to decide this is to see if the inode's block
   count times the block size for the filesystem is less than the inode's
   listed file size.   (That is, does the file have less space allocated
   than its listed size?)

   Here are a few ways I've used in the past to search for sparse files:

   - if you have GNU "ls" installed on this system:
       ls -sl --block-size=1
     and then check to see if the number in the first colum is smaller
     than the number in the "file size" column.

   - if you have GNU "stat" installed you can run
       stat -c "%n: alloc: %b * %B  size: %s"
     , and then check to see if the %b times %B value is less than the %s
     value.

   - using the standard Sun "ls", you can do
       ls -sl
     , and then multiply the value in the first column by 512.  (I assume
     the "block size" used is a constant 512 in that case, regardless of
     file system.)


* The doubling of the time mentioned in the man page is in the context
   of making an actual dump, but the slowdown is much worse for the
   estimate phase.  That's because normally when tar notices that the
   output file is /dev/null, it realizes that you don't actually want the
   data from the input files, and thus doesn't actually read through
   their contents, but simply looks at the file size (from the inode
   information) and adds that to the dump-size tally before moving on to
   the next file.  So the time spend during the estimate is almost
   entirely due to reading through the directory tree, and won't depend
   on the size of the files in question.

   In the case of a file that's actually sparse, though, if the --sparse
   option is enabled then tar has to actually read in the entire file to
   see how much of it is zero blocks.  So, if many of your files are
   indeed actually sparse, then what will happen is that the estimate
   time will be about the same as the actual dump time, rather than the
   usual much-shorter estimate time.

                                                Nathan





----------------------------------------------------------------------------
Nathan Stratton Treadway  -  [email protected]  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
  GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
  Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


--
---------------

Chris Hoogendyk

-
   O__  ---- Systems Administrator
  c/ /'_ --- Biology & Geology Departments
 (*) \(*) -- 140 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst

<[email protected]>

---------------

Erdös 4

Reply via email to