Re: all estimate timed out
Just to follow up on this. Amanda backups have been running smoothly for a week now. For this one DLE, I set up amgtar and disabled the sparse option. It ran, but took most of Saturday to complete. Then, having a full backup of that, I broke it up into 6 DLE's using excludes and includes. I added one a day back into the disklist. It now has them all and can spread the fulls over the week. Backups for the last couple of days have completed around 4am. As a followup, in case anyone cares to discuss technicalities and examples, has anyone run into this before? It seems any site doing lots of sizable scanned images, or GIS systems with tiff maps, would have run into it. I don't know how often sparse file treatment is an important thing. Database files can be sparse, but proper procedure is to use the database tools (e.g. mysqldump) for backups and not to just backup the data directory. It's not clear to me exactly what gnutar is doing with sparse or why it is so inefficient (timewise). I don't think these tif files are sparse. They are just large. And gnutar is not just doubling the time as described in http://www.gnu.org/software/tar/manual/html_node/sparse.html. I was experiencing on the order of 400 times as much time for the sparse option compared to when I removed the sparse option. [ Recalling details from earlier messages -- Amanda 3.3.2 with gtar 1.23 (/usr/sfw/bin/gtar) on Solaris 10 on a T5220 (UltraSPARC, 8 core, 32G memory) with multipath SAS interface to J4500 for storage using zfs raidz with 2TB drives. Nightly backups go out to an AIT5 tape library on an Ultra320 LVD SCSI interface. Backing up on the order of 100 DLEs from 5 machines over GigE on this Amanda server. Problem DLE was on localhost on the J4500. ] On 4/5/13 3:16 PM, Jean-Louis Martineau wrote: On 04/05/2013 12:09 PM, Chris Hoogendyk wrote: OK, folks, it is the --sparse option that Amanda is putting on the gtar. This is /usr/sfw/bin/tar version 1.23 on Solaris 10. I have a test script that runs the runtar and a test directory with just 10 of the tif files in it. Without the --sparse option, time tells me that it takes 0m0.57s to run the script. With the --sparse option, time tells me that it takes 3m14.91s to run the script. Scale that from 10 to 1300 tif files, and I have serious issues. Now what? Can I tell Amanda not to do that? What difference will it make? Is this a bug in gtar? Use the amgtar application instead of the GNUTAR program, it allow to disable the sparse option. tar can't know where are the holes, it must read them. You WANT the sparse option, otherwise your backup will be large because tar fill the holes with 0. Your best option is to use the calcsize or server estimate. Jean-Louis -- --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst hoogen...@bio.umass.edu --- Erdös 4
Re: all estimate timed out
On Fri, Apr 12, 2013 at 12:59:39 -0400, Chris Hoogendyk wrote: As a followup, in case anyone cares to discuss technicalities and examples, has anyone run into this before? It seems any site doing lots of sizable scanned images, or GIS systems with tiff maps, would have run into it. I don't know how often sparse file treatment is an important thing. Database files can be sparse, but proper procedure is to use the database tools (e.g. mysqldump) for backups and not to just backup the data directory. It's not clear to me exactly what gnutar is doing with sparse or why it is so inefficient (timewise). I don't think these tif files are sparse. They are just large. And gnutar is not just doubling the time as described in http://www.gnu.org/software/tar/manual/html_node/sparse.html. I was experiencing on the order of 400 times as much time for the sparse option compared to when I removed the sparse option. I have been meaning to reply to your earlier messages but haven't had a chance to finish the background research I wanted to do first; meanwhile, a few quick comments and questions: * When you did your manual test runs with and without --sparse, did the estimated sizes shown at the end of the run change any? * I'll have to go back and see if things were any different with GNU tar v1.23, but when I was looking at the latest version's source code it, it was clear that at least the intention was that using --sparse would only change behavior when the input files are sparse -- so I am curious to know for sure if your tiff files are actually sparse. The check that tar uses to decide this is to see if the inode's block count times the block size for the filesystem is less than the inode's listed file size. (That is, does the file have less space allocated than its listed size?) Here are a few ways I've used in the past to search for sparse files: - if you have GNU ls installed on this system: ls -sl --block-size=1 and then check to see if the number in the first colum is smaller than the number in the file size column. - if you have GNU stat installed you can run stat -c %n: alloc: %b * %B size: %s , and then check to see if the %b times %B value is less than the %s value. - using the standard Sun ls, you can do ls -sl , and then multiply the value in the first column by 512. (I assume the block size used is a constant 512 in that case, regardless of file system.) * The doubling of the time mentioned in the man page is in the context of making an actual dump, but the slowdown is much worse for the estimate phase. That's because normally when tar notices that the output file is /dev/null, it realizes that you don't actually want the data from the input files, and thus doesn't actually read through their contents, but simply looks at the file size (from the inode information) and adds that to the dump-size tally before moving on to the next file. So the time spend during the estimate is almost entirely due to reading through the directory tree, and won't depend on the size of the files in question. In the case of a file that's actually sparse, though, if the --sparse option is enabled then tar has to actually read in the entire file to see how much of it is zero blocks. So, if many of your files are indeed actually sparse, then what will happen is that the estimate time will be about the same as the actual dump time, rather than the usual much-shorter estimate time. Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: all estimate timed out
Thank you, Nathan. Informative. The Total bytes written: was identical with and without the --sparse option (right down to the last byte ;-) ). It was the time taken to arrive at that estimate that was so very different: Total bytes written: 2086440960 (2.0GiB, 11MiB/s) real3m14.91s Total bytes written: 2086440960 (2.0GiB, 17GiB/s) real0m0.57s However, if I do an `ls -sl` on the directory and multiply the first column by 512, that does not quite match the length in bytes column. It is the same order of magnitude, but they are slightly different. I'm not sure what causes that, but I don't think the tif files are really sparse in the usual sense of that. Any imaginable gain in efficiency with regard to space would be minimal, and the cost in time is ridiculous. Here is an example of one directory: marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE# ls -sl total 4072318 410608 -rw-rw 1 ariehtal herbarum 210246048 Dec 10 11:04 AC00312847.tif 402936 -rw-rw 1 ariehtal herbarum 206423224 Dec 5 16:09 AC00312848.tif 412398 -rw-rw 1 ariehtal herbarum 211246700 Dec 5 16:16 AC00312849.tif 405493 -rw-rw 1 ariehtal herbarum 207676904 Dec 12 11:52 AC00312850.tif 408052 -rw-rw 1 ariehtal herbarum 209052412 Dec 5 15:13 AC00312937.tif 412909 -rw-rw 1 ariehtal herbarum 211451884 Dec 5 15:35 AC00312939.tif 415468 -rw-rw 1 ariehtal herbarum 212788668 Dec 12 11:46 AC00312940.tif 390142 -rw-rw 1 ariehtal herbarum 199753780 Nov 13 11:28 AC00312941-sj0.tif 406004 -rw-rw 1 ariehtal herbarum 207925584 Dec 10 11:17 AC00312942.tif 408308 -rw-rw 1 ariehtal herbarum 209102728 Dec 10 11:28 AC00312943.tif marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE# On 4/12/13 3:41 PM, Nathan Stratton Treadway wrote: On Fri, Apr 12, 2013 at 12:59:39 -0400, Chris Hoogendyk wrote: As a followup, in case anyone cares to discuss technicalities and examples, has anyone run into this before? It seems any site doing lots of sizable scanned images, or GIS systems with tiff maps, would have run into it. I don't know how often sparse file treatment is an important thing. Database files can be sparse, but proper procedure is to use the database tools (e.g. mysqldump) for backups and not to just backup the data directory. It's not clear to me exactly what gnutar is doing with sparse or why it is so inefficient (timewise). I don't think these tif files are sparse. They are just large. And gnutar is not just doubling the time as described in http://www.gnu.org/software/tar/manual/html_node/sparse.html. I was experiencing on the order of 400 times as much time for the sparse option compared to when I removed the sparse option. I have been meaning to reply to your earlier messages but haven't had a chance to finish the background research I wanted to do first; meanwhile, a few quick comments and questions: * When you did your manual test runs with and without --sparse, did the estimated sizes shown at the end of the run change any? * I'll have to go back and see if things were any different with GNU tar v1.23, but when I was looking at the latest version's source code it, it was clear that at least the intention was that using --sparse would only change behavior when the input files are sparse -- so I am curious to know for sure if your tiff files are actually sparse. The check that tar uses to decide this is to see if the inode's block count times the block size for the filesystem is less than the inode's listed file size. (That is, does the file have less space allocated than its listed size?) Here are a few ways I've used in the past to search for sparse files: - if you have GNU ls installed on this system: ls -sl --block-size=1 and then check to see if the number in the first colum is smaller than the number in the file size column. - if you have GNU stat installed you can run stat -c %n: alloc: %b * %B size: %s , and then check to see if the %b times %B value is less than the %s value. - using the standard Sun ls, you can do ls -sl , and then multiply the value in the first column by 512. (I assume the block size used is a constant 512 in that case, regardless of file system.) * The doubling of the time mentioned in the man page is in the context of making an actual dump, but the slowdown is much worse for the estimate phase. That's because normally when tar notices that the output file is /dev/null, it realizes that you don't actually want the data from the input files, and thus doesn't actually read through their contents, but simply looks at the file size (from the inode information) and adds that to the dump-size tally before moving on to the next file. So the time spend during the estimate is almost entirely due to reading through the directory tree, and won't depend on the size of the files
Re: all estimate timed out
On Fri, Apr 12, 2013 at 17:09:11 -0400, Chris Hoogendyk wrote: The Total bytes written: was identical with and without the --sparse option (right down to the last byte ;-) ). It was the time taken to arrive at that estimate that was so very different: Total bytes written: 2086440960 (2.0GiB, 11MiB/s) real3m14.91s Total bytes written: 2086440960 (2.0GiB, 17GiB/s) real0m0.57s However, if I do an `ls -sl` on the directory and multiply the first column by 512, that does not quite match the length in bytes column. It is the same order of magnitude, but they are slightly different. I'm not sure what causes that, but I don't think the tif files are really sparse in the usual sense of that. Any imaginable gain in efficiency with regard to space would be minimal, and the cost in time is ridiculous. Here is an example of one directory: marlin:/export/herbarium/mellon/Masstypes_Scans_Server/ACANTHACEAE# ls -sl total 4072318 410608 -rw-rw 1 ariehtal herbarum 210246048 Dec 10 11:04 AC00312847.tif 402936 -rw-rw 1 ariehtal herbarum 206423224 Dec 5 16:09 AC00312848.tif Well, unless the length of the file is an exact multiple of the block size, you'll normally find that the figures will be slightly different... but the allocated space is always larger for non-sparse files. In your case, though, it's slightly smaller -- which is why you are having this problem 410608 * 512 = 210231296, 14752 less than 210246048 402936 * 512 = 206303232, 119992 less than 206423224 etc. However, when tar puts the files into the archive, it has it's own blocking factor, and it would seem that the space savings from the sparseness in your files is so small that it's lost within that blocking factor. So yes, you are definitely in a lots-of-pain-and-no-gain situation :( Do you know how these TIF files are getting written onto your system? You could avoid this problem if you were able to that process altered so that it didn't create sparse files... If the files are static, you could consider doing a pass through to un-sparsify them somehow. For example, doing a simple cp seems to be produce normal files: $ uname -a SunOS myhost 5.9 Generic_122300-66 sun4u sparc SUNW,Netra-210 $ which cp /usr/bin/cp $ mkdir test1 $ echo hi | dd of=test1/t.t seek=1 0+1 records in 0+1 records out $ cp -Rp test1 test2 $ ls -ls test1 test2 test1: total 48 48 -rw-r- 1 x474712 other5120003 Apr 12 18:05 t.t test2: total 10032 10032 -rw-r- 1 x474712 other5120003 Apr 12 18:05 t.t (Note that the copy of the file found in test2/ is fully allocated.) However, it sounds like in your particular situation the workaround of using a amgtar with --sparse turned off might be good enough (given that it's actually okay for the backup to ignore the fact the original files are sparse). Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: all estimate timed out
On 04/05/2013 12:09 PM, Chris Hoogendyk wrote: OK, folks, it is the --sparse option that Amanda is putting on the gtar. This is /usr/sfw/bin/tar version 1.23 on Solaris 10. I have a test script that runs the runtar and a test directory with just 10 of the tif files in it. Without the --sparse option, time tells me that it takes 0m0.57s to run the script. With the --sparse option, time tells me that it takes 3m14.91s to run the script. Scale that from 10 to 1300 tif files, and I have serious issues. Now what? Can I tell Amanda not to do that? What difference will it make? Is this a bug in gtar? Use the amgtar application instead of the GNUTAR program, it allow to disable the sparse option. tar can't know where are the holes, it must read them. You WANT the sparse option, otherwise your backup will be large because tar fill the holes with 0. Your best option is to use the calcsize or server estimate. Jean-Louis
Re: all estimate timed out
Chris, I don't know what tif files look like internally, don't know how they compress. Just of out left field... does your zpool have compression enabled? I realized zfs will compress or not on a per block basis, but I don't know what if any overhead is being incurred, if the tif files are not compressed then there should be no additional overhead to decompress them on read. I would also probably hesitate to enable compression of a zfs file system that was used for amanda work area, since you are storing data that has already been zip'd. Though this also has no impact on the estimate phase. Our site has tended to gzip --fast, rather than --best, and have on a few our our amanda servers moved to pigz. Again, potential amdump issues but not amcheck issues. Sanity check, the zpool itself is healthy? The drives are all of the same architecture and spindle speeds? good luck, Brian On Fri, Apr 05, 2013 at 11:09:16AM -0400, Chris Hoogendyk wrote: Thank you! Not sure why the debug file would list runtar in the form of a parameter, when it's not to be used as such. Anyway, that got it working. Which brings me back to my original problem. As indicated previously, the filesystem in question only has 2806 files and 140 directories. As I watch the runtar in verbose mode, when it hits the tif files, it is taking 20 seconds on each tif file. The tif files are scans of herbarium type specimens and are pretty uniformly 200MB each. If I do a find on all the tif files, piped to `wc -l`, there are 1300 of them. Times 20 seconds each gives me the 26000 seconds that shows up in the sendsize debug file for this filesystem. So, why would these tif files only be going by at 10MB/s into /dev/null? No compression involved. My (real) tapes run much faster than that. I also pointed out that I have more than a dozen other filesystems on the same zpool that are giving me no trouble (five 2TB drives in a raidz1 on a J4500 with multipath SAS). Any ideas how to speed that up? I think I may start out by breaking them down into sub DLE's. There are 129 directories corresponding to taxonomic families. On 4/4/13 8:05 PM, Jean-Louis Martineau wrote: On 04/04/2013 02:48 PM, Chris Hoogendyk wrote: I may just quietly go nuts. I'm trying to run the command directly. In the debug file, one example is: Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline So, I created a script working off that and adding verbose: #!/bin/ksh OPTIONS= --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental; OPTIONS=${OPTIONS} /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals --verbose .; COMMAND=/usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS}; #COMMAND=/usr/sfw/bin/gtar ${OPTIONS}; remove the 'runtar' argument exec ${COMMAND}; If I run that as user amanda, I get: runtar: Can only be used to create tar archives If I exchange the two commands so that I'm using gtar directly rather than runtar, then I get: /usr/sfw/bin/gtar: Cowardly refusing to create an empty archive Try `/usr/sfw/bin/gtar --help' or `/usr/sfw/bin/gtar --usage' for more information. -- --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst hoogen...@bio.umass.edu --- Erdös 4 --- Brian R Cuttler brian.cutt...@wadsworth.org Computer Systems Support(v) 518 486-1697 Wadsworth Center(f) 518 473-6384 NYS Department of HealthHelp Desk 518 473-0773
Re: all estimate timed out
OK, folks, it is the --sparse option that Amanda is putting on the gtar. This is /usr/sfw/bin/tar version 1.23 on Solaris 10. I have a test script that runs the runtar and a test directory with just 10 of the tif files in it. Without the --sparse option, time tells me that it takes 0m0.57s to run the script. With the --sparse option, time tells me that it takes 3m14.91s to run the script. Scale that from 10 to 1300 tif files, and I have serious issues. Now what? Can I tell Amanda not to do that? What difference will it make? Is this a bug in gtar? On 4/5/13 11:09 AM, Chris Hoogendyk wrote: Thank you! Not sure why the debug file would list runtar in the form of a parameter, when it's not to be used as such. Anyway, that got it working. Which brings me back to my original problem. As indicated previously, the filesystem in question only has 2806 files and 140 directories. As I watch the runtar in verbose mode, when it hits the tif files, it is taking 20 seconds on each tif file. The tif files are scans of herbarium type specimens and are pretty uniformly 200MB each. If I do a find on all the tif files, piped to `wc -l`, there are 1300 of them. Times 20 seconds each gives me the 26000 seconds that shows up in the sendsize debug file for this filesystem. So, why would these tif files only be going by at 10MB/s into /dev/null? No compression involved. My (real) tapes run much faster than that. I also pointed out that I have more than a dozen other filesystems on the same zpool that are giving me no trouble (five 2TB drives in a raidz1 on a J4500 with multipath SAS). Any ideas how to speed that up? I think I may start out by breaking them down into sub DLE's. There are 129 directories corresponding to taxonomic families. On 4/4/13 8:05 PM, Jean-Louis Martineau wrote: On 04/04/2013 02:48 PM, Chris Hoogendyk wrote: I may just quietly go nuts. I'm trying to run the command directly. In the debug file, one example is: Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline So, I created a script working off that and adding verbose: #!/bin/ksh OPTIONS= --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental; OPTIONS=${OPTIONS} /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals --verbose .; COMMAND=/usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS}; #COMMAND=/usr/sfw/bin/gtar ${OPTIONS}; remove the 'runtar' argument exec ${COMMAND}; If I run that as user amanda, I get: runtar: Can only be used to create tar archives If I exchange the two commands so that I'm using gtar directly rather than runtar, then I get: /usr/sfw/bin/gtar: Cowardly refusing to create an empty archive Try `/usr/sfw/bin/gtar --help' or `/usr/sfw/bin/gtar --usage' for more information. -- --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst hoogen...@bio.umass.edu --- Erdös 4
Re: all estimate timed out
Thank you! Not sure why the debug file would list runtar in the form of a parameter, when it's not to be used as such. Anyway, that got it working. Which brings me back to my original problem. As indicated previously, the filesystem in question only has 2806 files and 140 directories. As I watch the runtar in verbose mode, when it hits the tif files, it is taking 20 seconds on each tif file. The tif files are scans of herbarium type specimens and are pretty uniformly 200MB each. If I do a find on all the tif files, piped to `wc -l`, there are 1300 of them. Times 20 seconds each gives me the 26000 seconds that shows up in the sendsize debug file for this filesystem. So, why would these tif files only be going by at 10MB/s into /dev/null? No compression involved. My (real) tapes run much faster than that. I also pointed out that I have more than a dozen other filesystems on the same zpool that are giving me no trouble (five 2TB drives in a raidz1 on a J4500 with multipath SAS). Any ideas how to speed that up? I think I may start out by breaking them down into sub DLE's. There are 129 directories corresponding to taxonomic families. On 4/4/13 8:05 PM, Jean-Louis Martineau wrote: On 04/04/2013 02:48 PM, Chris Hoogendyk wrote: I may just quietly go nuts. I'm trying to run the command directly. In the debug file, one example is: Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline So, I created a script working off that and adding verbose: #!/bin/ksh OPTIONS= --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental; OPTIONS=${OPTIONS} /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals --verbose .; COMMAND=/usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS}; #COMMAND=/usr/sfw/bin/gtar ${OPTIONS}; remove the 'runtar' argument exec ${COMMAND}; If I run that as user amanda, I get: runtar: Can only be used to create tar archives If I exchange the two commands so that I'm using gtar directly rather than runtar, then I get: /usr/sfw/bin/gtar: Cowardly refusing to create an empty archive Try `/usr/sfw/bin/gtar --help' or `/usr/sfw/bin/gtar --usage' for more information. -- --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst hoogen...@bio.umass.edu --- Erdös 4
Re: all estimate timed out
Chris, sorry for the email trouble, this is a new phenomenon and I don't know what is causing it, if you can identify the bad header please let me know. We updated our mailhost a few months ago, but my MUA (mutt) has not changed nor has my editor (emacs). My large directories are exceptions, even here, and I am educating the users to do things differently. However I do have lots of files on zfs in general... I don't believe that gzip is used in the estimate phase, I think that it produces raw dump size for dump scheduling and that tape allocation is left for later in the process. If gzip is used you should see it in # ps, or top (or prstat), you could always start a dump after disabling estimate and see if that phase runs any better. Since you can be sure of finishing estimate phase by checking # amstatus, you can always abort the dump if you don't want a non-compressed backup. (Jean-Louis will know off-hand) How does the dump phase perform? On Wed, Apr 03, 2013 at 05:42:12PM -0400, Chris Hoogendyk wrote: For some reason, the headers in the particular message from the list (from Brian) are causing my mail client or something to completely strip the message so that it is blank when I reply. That is, I compose a message, it looks good, and I send it. But then I get a blank bcc, brian gets a blank message, and the list gets a blank message. Weird. So I'm replying to Christoph Scheeder's message and pasting in the contents for replying to Brian. That will put the list thread somewhat out of order, but better than completely disconnecting from the thread. Here goes (for the third time): --- So, Brian, this is the puzzle. Your file systems have a reason for being difficult. They have several hundred thousand files PER directory. The filesystem that is causing me trouble, as I indicated, only has 2806 total files and 140 total directories. That's basically nothing. So, is this gzip choking on tif files? Is gzip even involved when sending estimates? If I remove compression will it fix this? I could break it up into multiple DLE's, but Amanda will still need estimates of all the pieces. Or is it something entirely different? And, if so, how should I go about looking for it? On 4/3/13 1:14 PM, Brian Cuttler wrote: Chris, for larger file systems I've moved to server estimate, less accurate but takes the entire estimate phase out of the equation. We have had a lot of success with pig zip rather than regular gzip, is it'll take advantage of the mutiple CPUs and give parallelization during compression, which is often our bottleneck during actual dumping. In one system I cut DLE dump time from 13 to 8 hours, a huge savings (I think those where the numbers, I can look them up...). ZFS will allow unlimited capacity, and enough files per directory to choke access, we have backups that run very badly here, with litterally several hundred thousand files PER directory, and multiple such directories. For backups themselves, I do use snapshots where I can on my ZFS file systems. On Wed, Apr 03, 2013 at 11:26:01AM -0400, Chris Hoogendyk wrote: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library. I've upped the etimeout to 1800 and the dtimeout to 3600, which both seem outrageously long (jumped from the default 5 minutes to 30 minutes, and from the default 30 minutes to an hour). The filesystem (DLE) that is giving me trouble (hasn't backed up in a couple of weeks) is /export/herbarium, which looks like: marlin:/export/herbarium# df -k . Filesystemkbytesused avail capacity Mounted on J4500-pool1/herbarium 2040109465 262907572 177720189313% /export/herbarium marlin:/export/herbarium# find . -type f | wc -l 2806 marlin:/export/herbarium# find . -type d | wc -l 140 marlin:/export/herbarium# So, it is only 262G and only has 2806 files. Shouldn't be that big a deal. They are typically tif scans. One thought that hits me is: possibly, because it is over 200G of tif scans, compression is causing trouble? But this is just getting estimates, output going to /dev/null. Here is a segment from the very end of the sendsize debug file from April 1 (the debug file ends after these lines): Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: . Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 0:
Re: all estimate timed out
Still getting blank emails on a test reply (just to myself) to Brian's emails. So, I'm replying to my own email to the list and then pasting in the reply to Brian. It's clearly a weirdness in the headers coming from Brian, but it could also be some misbehavior in response to those by my mail client -- Thunderbird 17.0.5. I changed the dump type to not use compression. If tif files are not going to compress anyway, then I might as well not even ask Amanda to try. However, it never gets to the dump, because it gets all estimate timed out. I will try breaking it into multiple DLE's and also changing it to server estimate. But, until I know what is really causing the problem, I'm not optimistic about the possibility of a successful dump. As I said, everything else runs without trouble, including DLE's that are different zfs filesystems on the same zpool. On 4/4/13 9:39 AM, Brian Cuttler wrote: Chris, sorry for the email trouble, this is a new phenomenon and I don't know what is causing it, if you can identify the bad header please let me know. We updated our mailhost a few months ago, but my MUA (mutt) has not changed nor has my editor (emacs). My large directories are exceptions, even here, and I am educating the users to do things differently. However I do have lots of files on zfs in general... I don't believe that gzip is used in the estimate phase, I think that it produces raw dump size for dump scheduling and that tape allocation is left for later in the process. If gzip is used you should see it in # ps, or top (or prstat), you could always start a dump after disabling estimate and see if that phase runs any better. Since you can be sure of finishing estimate phase by checking # amstatus, you can always abort the dump if you don't want a non-compressed backup. (Jean-Louis will know off-hand) How does the dump phase perform? On Wed, Apr 03, 2013 at 05:42:12PM -0400, Chris Hoogendyk wrote: For some reason, the headers in the particular message from the list (from Brian) are causing my mail client or something to completely strip the message so that it is blank when I reply. That is, I compose a message, it looks good, and I send it. But then I get a blank bcc, brian gets a blank message, and the list gets a blank message. Weird. So I'm replying to Christoph Scheeder's message and pasting in the contents for replying to Brian. That will put the list thread somewhat out of order, but better than completely disconnecting from the thread. Here goes (for the third time): --- So, Brian, this is the puzzle. Your file systems have a reason for being difficult. They have several hundred thousand files PER directory. The filesystem that is causing me trouble, as I indicated, only has 2806 total files and 140 total directories. That's basically nothing. So, is this gzip choking on tif files? Is gzip even involved when sending estimates? If I remove compression will it fix this? I could break it up into multiple DLE's, but Amanda will still need estimates of all the pieces. Or is it something entirely different? And, if so, how should I go about looking for it? On 4/3/13 1:14 PM, Brian Cuttler wrote: Chris, for larger file systems I've moved to server estimate, less accurate but takes the entire estimate phase out of the equation. We have had a lot of success with pig zip rather than regular gzip, is it'll take advantage of the mutiple CPUs and give parallelization during compression, which is often our bottleneck during actual dumping. In one system I cut DLE dump time from 13 to 8 hours, a huge savings (I think those where the numbers, I can look them up...). ZFS will allow unlimited capacity, and enough files per directory to choke access, we have backups that run very badly here, with litterally several hundred thousand files PER directory, and multiple such directories. For backups themselves, I do use snapshots where I can on my ZFS file systems. On Wed, Apr 03, 2013 at 11:26:01AM -0400, Chris Hoogendyk wrote: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library. I've upped the etimeout to 1800 and the dtimeout to 3600, which both seem outrageously long (jumped from the default 5 minutes to 30 minutes, and from the default 30 minutes to an hour). The filesystem (DLE) that is giving me trouble (hasn't backed up in a couple of weeks) is /export/herbarium, which looks like: marlin:/export/herbarium# df -k . Filesystemkbytesused avail capacity Mounted on J4500-pool1/herbarium
Re: all estimate timed out
Reply using thunderbird rather than mutt. Any way to vet the zfs file system? Make sure its sane and doesn't contain some kind of a bad link causing a loop? If you where to run the command used by estimate, which I believe displays in the debug file, can you run that successfully on the command line? If you run it verbose, can you see where its hangs or where it slows down? On 4/4/2013 12:34 PM, Chris Hoogendyk wrote: Still getting blank emails on a test reply (just to myself) to Brian's emails. So, I'm replying to my own email to the list and then pasting in the reply to Brian. It's clearly a weirdness in the headers coming from Brian, but it could also be some misbehavior in response to those by my mail client -- Thunderbird 17.0.5. I changed the dump type to not use compression. If tif files are not going to compress anyway, then I might as well not even ask Amanda to try. However, it never gets to the dump, because it gets all estimate timed out. I will try breaking it into multiple DLE's and also changing it to server estimate. But, until I know what is really causing the problem, I'm not optimistic about the possibility of a successful dump. As I said, everything else runs without trouble, including DLE's that are different zfs filesystems on the same zpool. On 4/4/13 9:39 AM, Brian Cuttler wrote: Chris, sorry for the email trouble, this is a new phenomenon and I don't know what is causing it, if you can identify the bad header please let me know. We updated our mailhost a few months ago, but my MUA (mutt) has not changed nor has my editor (emacs). My large directories are exceptions, even here, and I am educating the users to do things differently. However I do have lots of files on zfs in general... I don't believe that gzip is used in the estimate phase, I think that it produces raw dump size for dump scheduling and that tape allocation is left for later in the process. If gzip is used you should see it in # ps, or top (or prstat), you could always start a dump after disabling estimate and see if that phase runs any better. Since you can be sure of finishing estimate phase by checking # amstatus, you can always abort the dump if you don't want a non-compressed backup. (Jean-Louis will know off-hand) How does the dump phase perform? On Wed, Apr 03, 2013 at 05:42:12PM -0400, Chris Hoogendyk wrote: For some reason, the headers in the particular message from the list (from Brian) are causing my mail client or something to completely strip the message so that it is blank when I reply. That is, I compose a message, it looks good, and I send it. But then I get a blank bcc, brian gets a blank message, and the list gets a blank message. Weird. So I'm replying to Christoph Scheeder's message and pasting in the contents for replying to Brian. That will put the list thread somewhat out of order, but better than completely disconnecting from the thread. Here goes (for the third time): --- So, Brian, this is the puzzle. Your file systems have a reason for being difficult. They have several hundred thousand files PER directory. The filesystem that is causing me trouble, as I indicated, only has 2806 total files and 140 total directories. That's basically nothing. So, is this gzip choking on tif files? Is gzip even involved when sending estimates? If I remove compression will it fix this? I could break it up into multiple DLE's, but Amanda will still need estimates of all the pieces. Or is it something entirely different? And, if so, how should I go about looking for it? On 4/3/13 1:14 PM, Brian Cuttler wrote: Chris, for larger file systems I've moved to server estimate, less accurate but takes the entire estimate phase out of the equation. We have had a lot of success with pig zip rather than regular gzip, is it'll take advantage of the mutiple CPUs and give parallelization during compression, which is often our bottleneck during actual dumping. In one system I cut DLE dump time from 13 to 8 hours, a huge savings (I think those where the numbers, I can look them up...). ZFS will allow unlimited capacity, and enough files per directory to choke access, we have backups that run very badly here, with litterally several hundred thousand files PER directory, and multiple such directories. For backups themselves, I do use snapshots where I can on my ZFS file systems. On Wed, Apr 03, 2013 at 11:26:01AM -0400, Chris Hoogendyk wrote: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library.
Re: all estimate timed out
I may just quietly go nuts. I'm trying to run the command directly. In the debug file, one example is: Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline So, I created a script working off that and adding verbose: #!/bin/ksh OPTIONS= --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental; OPTIONS=${OPTIONS} /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals --verbose .; COMMAND=/usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS}; #COMMAND=/usr/sfw/bin/gtar ${OPTIONS}; exec ${COMMAND}; If I run that as user amanda, I get: runtar: Can only be used to create tar archives If I exchange the two commands so that I'm using gtar directly rather than runtar, then I get: /usr/sfw/bin/gtar: Cowardly refusing to create an empty archive Try `/usr/sfw/bin/gtar --help' or `/usr/sfw/bin/gtar --usage' for more information. On 4/4/13 1:22 PM, Brian Cuttler wrote: Reply using thunderbird rather than mutt. Any way to vet the zfs file system? Make sure its sane and doesn't contain some kind of a bad link causing a loop? If you where to run the command used by estimate, which I believe displays in the debug file, can you run that successfully on the command line? If you run it verbose, can you see where its hangs or where it slows down? On 4/4/2013 12:34 PM, Chris Hoogendyk wrote: Still getting blank emails on a test reply (just to myself) to Brian's emails. So, I'm replying to my own email to the list and then pasting in the reply to Brian. It's clearly a weirdness in the headers coming from Brian, but it could also be some misbehavior in response to those by my mail client -- Thunderbird 17.0.5. I changed the dump type to not use compression. If tif files are not going to compress anyway, then I might as well not even ask Amanda to try. However, it never gets to the dump, because it gets all estimate timed out. I will try breaking it into multiple DLE's and also changing it to server estimate. But, until I know what is really causing the problem, I'm not optimistic about the possibility of a successful dump. As I said, everything else runs without trouble, including DLE's that are different zfs filesystems on the same zpool. On 4/4/13 9:39 AM, Brian Cuttler wrote: Chris, sorry for the email trouble, this is a new phenomenon and I don't know what is causing it, if you can identify the bad header please let me know. We updated our mailhost a few months ago, but my MUA (mutt) has not changed nor has my editor (emacs). My large directories are exceptions, even here, and I am educating the users to do things differently. However I do have lots of files on zfs in general... I don't believe that gzip is used in the estimate phase, I think that it produces raw dump size for dump scheduling and that tape allocation is left for later in the process. If gzip is used you should see it in # ps, or top (or prstat), you could always start a dump after disabling estimate and see if that phase runs any better. Since you can be sure of finishing estimate phase by checking # amstatus, you can always abort the dump if you don't want a non-compressed backup. (Jean-Louis will know off-hand) How does the dump phase perform? On Wed, Apr 03, 2013 at 05:42:12PM -0400, Chris Hoogendyk wrote: For some reason, the headers in the particular message from the list (from Brian) are causing my mail client or something to completely strip the message so that it is blank when I reply. That is, I compose a message, it looks good, and I send it. But then I get a blank bcc, brian gets a blank message, and the list gets a blank message. Weird. So I'm replying to Christoph Scheeder's message and pasting in the contents for replying to Brian. That will put the list thread somewhat out of order, but better than completely disconnecting from the thread. Here goes (for the third time): --- So, Brian, this is the puzzle. Your file systems have a reason for being difficult. They have several hundred thousand files PER directory. The filesystem that is causing me trouble, as I indicated, only has 2806 total files and 140 total directories. That's basically nothing. So, is this gzip choking on tif files? Is gzip even involved when sending estimates? If I remove compression will it fix this? I could break it up into multiple DLE's, but Amanda will still need estimates of all the pieces. Or is it something entirely different? And, if so, how should I go about looking for it? On 4/3/13 1:14 PM, Brian
Re: all estimate timed out
On 04/04/2013 02:48 PM, Chris Hoogendyk wrote: I may just quietly go nuts. I'm trying to run the command directly. In the debug file, one example is: Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline So, I created a script working off that and adding verbose: #!/bin/ksh OPTIONS= --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental; OPTIONS=${OPTIONS} /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals --verbose .; COMMAND=/usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS}; #COMMAND=/usr/sfw/bin/gtar ${OPTIONS}; remove the 'runtar' argument exec ${COMMAND}; If I run that as user amanda, I get: runtar: Can only be used to create tar archives If I exchange the two commands so that I'm using gtar directly rather than runtar, then I get: /usr/sfw/bin/gtar: Cowardly refusing to create an empty archive Try `/usr/sfw/bin/gtar --help' or `/usr/sfw/bin/gtar --usage' for more information.
Re: all estimate timed out
On Thu, Apr 04, 2013 at 17:48:46 -0400, Chris Hoogendyk wrote: If I exchange the two commands so that I'm using gtar directly rather than runtar, then I get: /usr/sfw/bin/gtar: Cowardly refusing to create an empty archive Try `/usr/sfw/bin/gtar --help' or `/usr/sfw/bin/gtar --usage' for more information. I can't see why this is happening off hand, but generally that means that either the trailing . is missing from the command that was actually executed, or that argument getting eaten by some other option. You might try printing out out ${COMMAND} immediately before running it, just to make sure nothing obvious is missing that way. (Also, any particular reason you are using exec here? I don't know why it would be eating the . under ksh, but you might try without that and see if the problem goes away.) Worst case, try adding the name of a file found in your /export/herbarium directory after the . and see if that at least allows gtar to run. Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: all estimate timed out
On Thu, Apr 04, 2013 at 17:48:46 -0400, Chris Hoogendyk wrote: So, I created a script working off that and adding verbose: #!/bin/ksh OPTIONS= --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental; OPTIONS=${OPTIONS} /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals --verbose .; COMMAND=/usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS}; #COMMAND=/usr/sfw/bin/gtar ${OPTIONS}; exec ${COMMAND}; If I run that as user amanda, I get: runtar: Can only be used to create tar archives (Personally I'd do my initial investigation using gtar directly, but I see that runtar prints that error message when it finds that argv[3] isn't --create, and also that it expects argv[1] to be the config name. So I think it would work if you just left out the standalone runtar from the command: COMMAND=/usr/local/libexec/amanda/runtar daily /usr/local/etc/amanda/tools/gtar ${OPTIONS} ) Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: all estimate timed out
Hi Chris, Am 03.04.2013 17:26, schrieb Chris Hoogendyk: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library. I've upped the etimeout to 1800 and the dtimeout to 3600, which both seem outrageously long (jumped from the default 5 minutes to 30 minutes, and from the default 30 minutes to an hour). The filesystem (DLE) that is giving me trouble (hasn't backed up in a couple of weeks) is /export/herbarium, which looks like: marlin:/export/herbarium# df -k . Filesystemkbytesused avail capacity Mounted on J4500-pool1/herbarium 2040109465 262907572 177720189313% /export/herbarium marlin:/export/herbarium# find . -type f | wc -l 2806 marlin:/export/herbarium# find . -type d | wc -l 140 marlin:/export/herbarium# So, it is only 262G and only has 2806 files. Shouldn't be that big a deal. They are typically tif scans. One thought that hits me is: possibly, because it is over 200G of tif scans, compression is causing trouble? But this is just getting estimates, output going to /dev/null. Here is a segment from the very end of the sendsize debug file from April 1 (the debug file ends after these lines): Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: . Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 0: 26302.500 Nice, it took 7 hours 18 Minutes and 22 Seconds to get the level-0 estimate. Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 0: 262993150 KB Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: getting size via gnutar for /export/herbarium level 1 Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: Total bytes written: 77663795200 (73GiB, 9.5MiB/s) Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: . Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 1: 7827.571 Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 1: 75843550 KB Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: done with amname /export/herbarium dirname /export/herbarium spindle 45002 and aditional it took 2 hours 11 minutes getting the level-1 estimate. in sum it took about 9 and a half hour to get the estimates so your etimeout of 30 minutes is a little bit low for this machine, isn't it? you should consider using another method of getting estimates for that machine, or you should find out what makes the estimates on that machine so slow, as the backup itself will likely take longer then the estimates Christoph
Re: all estimate timed out
On 4/3/13 12:15 PM, C.Scheeder wrote: Hi Chris, Am 03.04.2013 17:26, schrieb Chris Hoogendyk: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library. I've upped the etimeout to 1800 and the dtimeout to 3600, which both seem outrageously long (jumped from the default 5 minutes to 30 minutes, and from the default 30 minutes to an hour). The filesystem (DLE) that is giving me trouble (hasn't backed up in a couple of weeks) is /export/herbarium, which looks like: marlin:/export/herbarium# df -k . Filesystemkbytesused avail capacity Mounted on J4500-pool1/herbarium 2040109465 262907572 177720189313% /export/herbarium marlin:/export/herbarium# find . -type f | wc -l 2806 marlin:/export/herbarium# find . -type d | wc -l 140 marlin:/export/herbarium# So, it is only 262G and only has 2806 files. Shouldn't be that big a deal. They are typically tif scans. One thought that hits me is: possibly, because it is over 200G of tif scans, compression is causing trouble? But this is just getting estimates, output going to /dev/null. Here is a segment from the very end of the sendsize debug file from April 1 (the debug file ends after these lines): Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: . Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 0: 26302.500 Nice, it took 7 hours 18 Minutes and 22 Seconds to get the level-0 estimate. Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 0: 262993150 KB Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: getting size via gnutar for /export/herbarium level 1 Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: Total bytes written: 77663795200 (73GiB, 9.5MiB/s) Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: . Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 1: 7827.571 Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 1: 75843550 KB Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: done with amname /export/herbarium dirname /export/herbarium spindle 45002 and aditional it took 2 hours 11 minutes getting the level-1 estimate. in sum it took about 9 and a half hour to get the estimates so your etimeout of 30 minutes is a little bit low for this machine, isn't it? you should consider using another method of getting estimates for that machine, or you should find out what makes the estimates on that machine so slow, as the backup itself will likely take longer then the estimates I should just note that when you say that machine, it is really that DLE. There are many other DLE's on that machine, on that disk array, and even on that same zpool, that return estimates and that successfully backup. -- --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst hoogen...@bio.umass.edu --- Erdös 4
Re: all estimate timed out
Chris, for larger file systems I've moved to server estimate, less accurate but takes the entire estimate phase out of the equation. We have had a lot of success with pig zip rather than regular gzip, is it'll take advantage of the mutiple CPUs and give parallelization during compression, which is often our bottleneck during actual dumping. In one system I cut DLE dump time from 13 to 8 hours, a huge savings (I think those where the numbers, I can look them up...). ZFS will allow unlimited capacity, and enough files per directory to choke access, we have backups that run very badly here, with litterally several hundred thousand files PER directory, and multiple such directories. For backups themselves, I do use snapshots where I can on my ZFS file systems. On Wed, Apr 03, 2013 at 11:26:01AM -0400, Chris Hoogendyk wrote: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library. I've upped the etimeout to 1800 and the dtimeout to 3600, which both seem outrageously long (jumped from the default 5 minutes to 30 minutes, and from the default 30 minutes to an hour). The filesystem (DLE) that is giving me trouble (hasn't backed up in a couple of weeks) is /export/herbarium, which looks like: marlin:/export/herbarium# df -k . Filesystemkbytesused avail capacity Mounted on J4500-pool1/herbarium 2040109465 262907572 177720189313% /export/herbarium marlin:/export/herbarium# find . -type f | wc -l 2806 marlin:/export/herbarium# find . -type d | wc -l 140 marlin:/export/herbarium# So, it is only 262G and only has 2806 files. Shouldn't be that big a deal. They are typically tif scans. One thought that hits me is: possibly, because it is over 200G of tif scans, compression is causing trouble? But this is just getting estimates, output going to /dev/null. Here is a segment from the very end of the sendsize debug file from April 1 (the debug file ends after these lines): Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: . Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 0: 26302.500 Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 0: 262993150 KB Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: getting size via gnutar for /export/herbarium level 1 Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: Total bytes written: 77663795200 (73GiB, 9.5MiB/s) Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: . Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 1: 7827.571 Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 1: 75843550 KB Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: done with amname /export/herbarium dirname /export/herbarium spindle 45002 -- --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst hoogen...@bio.umass.edu --- Erd?s 4 --- Brian R Cuttler brian.cutt...@wadsworth.org Computer Systems Support(v) 518 486-1697 Wadsworth Center(f) 518 473-6384 NYS Department of HealthHelp Desk 518 473-0773
Re: all estimate timed out
For some reason, the headers in the particular message from the list (from Brian) are causing my mail client or something to completely strip the message so that it is blank when I reply. That is, I compose a message, it looks good, and I send it. But then I get a blank bcc, brian gets a blank message, and the list gets a blank message. Weird. So I'm replying to Christoph Scheeder's message and pasting in the contents for replying to Brian. That will put the list thread somewhat out of order, but better than completely disconnecting from the thread. Here goes (for the third time): --- So, Brian, this is the puzzle. Your file systems have a reason for being difficult. They have several hundred thousand files PER directory. The filesystem that is causing me trouble, as I indicated, only has 2806 total files and 140 total directories. That's basically nothing. So, is this gzip choking on tif files? Is gzip even involved when sending estimates? If I remove compression will it fix this? I could break it up into multiple DLE's, but Amanda will still need estimates of all the pieces. Or is it something entirely different? And, if so, how should I go about looking for it? On 4/3/13 1:14 PM, Brian Cuttler wrote: Chris, for larger file systems I've moved to server estimate, less accurate but takes the entire estimate phase out of the equation. We have had a lot of success with pig zip rather than regular gzip, is it'll take advantage of the mutiple CPUs and give parallelization during compression, which is often our bottleneck during actual dumping. In one system I cut DLE dump time from 13 to 8 hours, a huge savings (I think those where the numbers, I can look them up...). ZFS will allow unlimited capacity, and enough files per directory to choke access, we have backups that run very badly here, with litterally several hundred thousand files PER directory, and multiple such directories. For backups themselves, I do use snapshots where I can on my ZFS file systems. On Wed, Apr 03, 2013 at 11:26:01AM -0400, Chris Hoogendyk wrote: This seems like an obvious read the FAQ situation, but . . . I'm running Amanda 3.3.2 on a Sun T5220 with Solaris 10 and a J4500 jbod disk array with multipath SAS. It all should be fast and is on the local server, so there isn't any network path outside localhost for the DLE's that are giving me trouble. They are zfs on raidz1 with five 2TB drives. Gnutar is v1.23. This server is successfully backing up several other servers as well as many more DLE's on the localhost. Output to an AIT5 tape library. I've upped the etimeout to 1800 and the dtimeout to 3600, which both seem outrageously long (jumped from the default 5 minutes to 30 minutes, and from the default 30 minutes to an hour). The filesystem (DLE) that is giving me trouble (hasn't backed up in a couple of weeks) is /export/herbarium, which looks like: marlin:/export/herbarium# df -k . Filesystemkbytesused avail capacity Mounted on J4500-pool1/herbarium 2040109465 262907572 177720189313% /export/herbarium marlin:/export/herbarium# find . -type f | wc -l 2806 marlin:/export/herbarium# find . -type d | wc -l 140 marlin:/export/herbarium# So, it is only 262G and only has 2806 files. Shouldn't be that big a deal. They are typically tif scans. One thought that hits me is: possibly, because it is over 200G of tif scans, compression is causing trouble? But this is just getting estimates, output going to /dev/null. Here is a segment from the very end of the sendsize debug file from April 1 (the debug file ends after these lines): Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: . Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 0: 26302.500 Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 0: 262993150 KB Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: waiting for runtar /export/herbarium child Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: after runtar /export/herbarium wait Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: getting size via gnutar for /export/herbarium level 1 Mon Apr 1 08:05:49 2013: thd-32a58: sendsize: Spawning /usr/local/libexec/amanda/runtar runtar daily /usr/local/etc/amanda/tools/gtar --create --file /dev/null --numeric-owner --directory /export/herbarium --one-file-system --listed-incremental /usr/local/var/amanda/gnutar-lists/localhost_export_herbarium_1.new --sparse --ignore-failed-read --totals . in pipeline Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: Total bytes written: 77663795200 (73GiB, 9.5MiB/s) Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: . Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate time for /export/herbarium level 1: 7827.571 Mon Apr 1 10:16:17 2013: thd-32a58: sendsize: estimate size for /export/herbarium level 1: 75843550 KB Mon Apr 1
Re: all estimate timed out
John Heim schrieb: marvin /var lev 0 FAILED [disk /var, all estimate timed out] marvin /etc lev 0 FAILED [disk /etc, all estimate timed out] marvin /backup/ulam/current/mail lev 0 FAILED [disk /backup/ulam/current/mail, all estimate timed out] planner: ERROR Request to marvin failed: timeout waiting for REP Have you checked the amanda.conf etimeout parameter? Maybe try inceasing it. -- Marc Muehlfeld (Leitung IT) Zentrum fuer Humangenetik und Laboratoriumsmedizin Dr. Klein und Dr. Rost Lochhamer Str. 29 - D-82152 Martinsried Telefon: +49(0)89/895578-0 - Fax: +49(0)89/895578-78 http://www.medizinische-genetik.de