seemingly random pattern of dump failures

2013-05-09 Thread Chris Hoogendyk
On one of my older setups that is running Amanda 2.5.1p3, I'm getting patterns of failures that I 
can't make sense of. Some days everything works just fine. Other days I get everything from a couple 
of DLEs failing to a half dozen. They are typically the same three servers, which are in another 
building. However, there is a server that has the largest DLEs in that same building that does not 
exhibit failures. There are also several servers in the same building as the Amanda backup server 
that don't show any failures. I even made a spreadsheet that shows a 0, 1 or 2 for backup levels of 
successes and a red x for failures. Can't see any pattern.


I've looked at a bunch of things and pored over the log files to no avail.

The errors show up in the Amanda reports as:

  anise.nsm.umass.edu/home lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  metzi1.physics.umass.edu   /data lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  anise.nsm.umass.edu/home lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  anise.nsm.umass.edu/home lev 1  FAILED [too many dumper retry: 
[request failed: timeout waiting for ACK]]
  metzi1.physics.umass.edu   /data lev 1  FAILED [too many dumper retry: 
[request failed: timeout waiting for ACK]]
  metzi1.physics.umass.edu   /data lev 1  FAILED [cannot read header: got 0 
instead of 32768]

The interesting thing is that if I go to metzi1, into /tmp/amanda/client/daily/ and do an `ls 
*0508*` I can see the debug logs from last night. If I do a `grep '/data' *0508*` I can see any 
entries that mention the DLE /data. I see no instances of sendbackup. I see only those runtar debug 
files that correspond to the size estimates for 0, 1 and 2 level backups. There is no runtar that 
would correspond to an actual dump.


If I do the same thing with `grep '/home' *0508*` (the DLE /home was successfully backed up), then I 
see all the runtar debug files for the estimates as well as a runtar debug file for the actual 
backup. I also see several lines in the sendbackup debug file for /home.


I've also looked through /var/log/syslog, /var/log/auth.log, etc. on the client (which is Ubuntu 
12.04 LTS), and I've looked through Amanda debug logs, /var/adm/messages, /var/adm/authlog, etc. on 
the server (which is Solaris 10). I don't see any logged errors for dropped connections or failures 
of any sort. The Amanda logs just don't mention /data on metzi1. A couple of the other servers that 
are being backed up are Ubuntu 12.04 and several are Ubuntu 10.04. None of them have been tweeked 
for sshd_config. All have tcpkeepalive turned on.


I tried bumping up the timeouts in amanda.conf (by a factor of 5). That seems a bit much, and it 
didn't seem to make any difference.


What should I be looking for? Where would Amanda log what is going on? (Or, why would it not be 
logging it?)



Thank you,


--
---

Chris Hoogendyk

-
   O__   Systems Administrator
  c/ /'_ --- Biology  Geology Departments
 (*) \(*) -- 140 Morrill Science Center
~~ - University of Massachusetts, Amherst

hoogen...@bio.umass.edu

---

Erdös 4



Re: seemingly random pattern of dump failures

2013-05-09 Thread Jean-Louis Martineau

Chris,

look in the amandad debug files on the client and the dumper debug file 
on the server.


Jean-Louis

On 05/09/2013 03:06 PM, Chris Hoogendyk wrote:
On one of my older setups that is running Amanda 2.5.1p3, I'm getting 
patterns of failures that I can't make sense of. Some days everything 
works just fine. Other days I get everything from a couple of DLEs 
failing to a half dozen. They are typically the same three servers, 
which are in another building. However, there is a server that has the 
largest DLEs in that same building that does not exhibit failures. 
There are also several servers in the same building as the Amanda 
backup server that don't show any failures. I even made a spreadsheet 
that shows a 0, 1 or 2 for backup levels of successes and a red x for 
failures. Can't see any pattern.


I've looked at a bunch of things and pored over the log files to no 
avail.


The errors show up in the Amanda reports as:

  anise.nsm.umass.edu/home lev 1  FAILED [cannot read 
header: got 0 instead of 32768]
  metzi1.physics.umass.edu   /data lev 1  FAILED [cannot read 
header: got 0 instead of 32768]
  anise.nsm.umass.edu/home lev 1  FAILED [cannot read 
header: got 0 instead of 32768]
  anise.nsm.umass.edu/home lev 1  FAILED [too many dumper 
retry: [request failed: timeout waiting for ACK]]
  metzi1.physics.umass.edu   /data lev 1  FAILED [too many dumper 
retry: [request failed: timeout waiting for ACK]]
  metzi1.physics.umass.edu   /data lev 1  FAILED [cannot read 
header: got 0 instead of 32768]


The interesting thing is that if I go to metzi1, into 
/tmp/amanda/client/daily/ and do an `ls *0508*` I can see the debug 
logs from last night. If I do a `grep '/data' *0508*` I can see any 
entries that mention the DLE /data. I see no instances of sendbackup. 
I see only those runtar debug files that correspond to the size 
estimates for 0, 1 and 2 level backups. There is no runtar that would 
correspond to an actual dump.


If I do the same thing with `grep '/home' *0508*` (the DLE /home was 
successfully backed up), then I see all the runtar debug files for the 
estimates as well as a runtar debug file for the actual backup. I also 
see several lines in the sendbackup debug file for /home.


I've also looked through /var/log/syslog, /var/log/auth.log, etc. on 
the client (which is Ubuntu 12.04 LTS), and I've looked through Amanda 
debug logs, /var/adm/messages, /var/adm/authlog, etc. on the server 
(which is Solaris 10). I don't see any logged errors for dropped 
connections or failures of any sort. The Amanda logs just don't 
mention /data on metzi1. A couple of the other servers that are being 
backed up are Ubuntu 12.04 and several are Ubuntu 10.04. None of them 
have been tweeked for sshd_config. All have tcpkeepalive turned on.


I tried bumping up the timeouts in amanda.conf (by a factor of 5). 
That seems a bit much, and it didn't seem to make any difference.


What should I be looking for? Where would Amanda log what is going on? 
(Or, why would it not be logging it?)



Thank you,






Re: seemingly random pattern of dump failures

2013-05-09 Thread Debra S Baddorf
That's odd -- I'm running v3.3.3 and I just started having the same sort of 
failures.  There is no amandad  debug file.
There is one odd line in the  server/client/chunker  debug file:

 chunker: putresult: 10 FAILED

and a few  0 length files sitting around on my spool disk.

Did something OUTSIDE of amanda change recently?   Like,  an update to   gtar   
or something else that was automatically applied
to us?
Deb

On May 9, 2013, at 2:06 PM, Chris Hoogendyk wrote:

 On one of my older setups that is running Amanda 2.5.1p3, I'm getting 
 patterns of failures that I can't make sense of. Some days everything works 
 just fine. Other days I get everything from a couple of DLEs failing to a 
 half dozen. They are typically the same three servers, which are in another 
 building. However, there is a server that has the largest DLEs in that same 
 building that does not exhibit failures. There are also several servers in 
 the same building as the Amanda backup server that don't show any failures. I 
 even made a spreadsheet that shows a 0, 1 or 2 for backup levels of successes 
 and a red x for failures. Can't see any pattern.
 
 I've looked at a bunch of things and pored over the log files to no avail.
 
 The errors show up in the Amanda reports as:
 
  anise.nsm.umass.edu/home lev 1  FAILED [cannot read header: got 
 0 instead of 32768]
  metzi1.physics.umass.edu   /data lev 1  FAILED [cannot read header: got 
 0 instead of 32768]
  anise.nsm.umass.edu/home lev 1  FAILED [cannot read header: got 
 0 instead of 32768]
  anise.nsm.umass.edu/home lev 1  FAILED [too many dumper retry: 
 [request failed: timeout waiting for ACK]]
  metzi1.physics.umass.edu   /data lev 1  FAILED [too many dumper retry: 
 [request failed: timeout waiting for ACK]]
  metzi1.physics.umass.edu   /data lev 1  FAILED [cannot read header: got 
 0 instead of 32768]
 
 The interesting thing is that if I go to metzi1, into 
 /tmp/amanda/client/daily/ and do an `ls *0508*` I can see the debug logs from 
 last night. If I do a `grep '/data' *0508*` I can see any entries that 
 mention the DLE /data. I see no instances of sendbackup. I see only those 
 runtar debug files that correspond to the size estimates for 0, 1 and 2 level 
 backups. There is no runtar that would correspond to an actual dump.
 
 If I do the same thing with `grep '/home' *0508*` (the DLE /home was 
 successfully backed up), then I see all the runtar debug files for the 
 estimates as well as a runtar debug file for the actual backup. I also see 
 several lines in the sendbackup debug file for /home.
 
 I've also looked through /var/log/syslog, /var/log/auth.log, etc. on the 
 client (which is Ubuntu 12.04 LTS), and I've looked through Amanda debug 
 logs, /var/adm/messages, /var/adm/authlog, etc. on the server (which is 
 Solaris 10). I don't see any logged errors for dropped connections or 
 failures of any sort. The Amanda logs just don't mention /data on metzi1. A 
 couple of the other servers that are being backed up are Ubuntu 12.04 and 
 several are Ubuntu 10.04. None of them have been tweeked for sshd_config. All 
 have tcpkeepalive turned on.
 
 I tried bumping up the timeouts in amanda.conf (by a factor of 5). That seems 
 a bit much, and it didn't seem to make any difference.
 
 What should I be looking for? Where would Amanda log what is going on? (Or, 
 why would it not be logging it?)
 
 
 Thank you,
 
 
 -- 
 ---
 
 Chris Hoogendyk
 
 -
   O__   Systems Administrator
  c/ /'_ --- Biology  Geology Departments
 (*) \(*) -- 140 Morrill Science Center
 ~~ - University of Massachusetts, Amherst
 
 hoogen...@bio.umass.edu
 
 ---
 
 Erdös 4
 




Re: seemingly random pattern of dump failures

2013-05-09 Thread Chris Hoogendyk

So,

On the client, in /tmp/amanda/amandad/, I see the debug files for the estimates and for each of the 
DLEs except for the one that failed. No entry for it from 5/8 or 5/9 (after midnight). Did a more on 
`ls *0508* *0509* | sort` and scanned through that with find.


On the server, in /tmp/amanda/server/daily/, I also don't see any references to the DLE  /data  in 
any of the dumper debug files from 5/8 or 5/9. I do see some sections where the timeout waiting for 
ACK appear, but I don't know what to make of them. They don't seem to offer me much in the way of 
diagnostics or the assurance that they are talking about the /data DLE.


Here is one of those pieces with the successful items before and after:

dumper: Building type 4 (FILE) header of size 32768 using:
dumper: Contents of *(dumpfile_t *)28894:
dumper: type = 4 (FILE)
dumper: datestamp= '20130508224500'
dumper: dumplevel= 1
dumper: compressed   = 1
dumper: encrypted= 0
dumper: comp_suffix  = '.gz'
dumper: encrypt_suffix   = 'N'
dumper: name = 'metzi1.physics.umass.edu'
dumper: disk = '/usr/local'
dumper: program  = '/bin/tar'
dumper: srvcompprog  = ''
dumper: clntcompprog = ''
dumper: srv_encrypt  = ''
dumper: clnt_encrypt = ''
dumper: recover_cmd  = '/bin/gzip -dc |/bin/tar -f - ...'
dumper: uncompress_cmd   = ''
dumper: encrypt_cmd  = ''
dumper: decrypt_cmd  = ''
dumper: srv_decrypt_opt  = ''
dumper: clnt_decrypt_opt = ''
dumper: cont_filename= ''
dumper: is_partial   = 0
dumper: partnum  = 0
dumper: totalparts   = 0
dumper: blocksize= 32768
security_stream_seterr(57cd0, EOF)
security_stream_close(57cd0)
security_stream_seterr(47c50, EOF)
security_stream_close(47c50)
security_stream_seterr(4fc90, EOF)
security_stream_close(4fc90)
dumper: connect_port: Try  port 1025: Available   -
dumper: connected to 127.0.0.1.36800
dumper: our side is 0.0.0.0.1025
dumper: try_socksize: send buffer size is 65536
security_getdriver(name=ssh) returns ff316638
security_handleinit(handle=335a0, driver=ff316638 (SSH))
security_streaminit(stream=3fc10, driver=ff316638 (SSH))
security_seterror(handle=335a0, driver=ff316638 (SSH) error=timeout waiting for 
ACK)
security_close(handle=335a0, driver=ff316638 (SSH))
security_stream_close(3fc10)
dumper: connect_port: Try  port 1025: Available   -
dumper: connected to 127.0.0.1.36806
dumper: our side is 0.0.0.0.1025
dumper: try_socksize: send buffer size is 65536
security_getdriver(name=ssh) returns ff316638
security_handleinit(handle=335a0, driver=ff316638 (SSH))
security_streaminit(stream=3fc10, driver=ff316638 (SSH))
security_seterror(handle=335a0, driver=ff316638 (SSH) error=timeout waiting for 
ACK)
security_close(handle=335a0, driver=ff316638 (SSH))
security_stream_close(3fc10)
dumper: connect_port: Try  port 1025: Available   -
dumper: connected to 127.0.0.1.36810
dumper: our side is 0.0.0.0.1025
dumper: try_socksize: send buffer size is 65536
security_getdriver(name=ssh) returns ff316638
security_handleinit(handle=33398, driver=ff316638 (SSH))
security_streaminit(stream=3fc10, driver=ff316638 (SSH))
security_streaminit(stream=47c50, driver=ff316638 (SSH))
security_streaminit(stream=4fc90, driver=ff316638 (SSH))
security_streaminit(stream=57cd0, driver=ff316638 (SSH))
security_close(handle=33398, driver=ff316638 (SSH))
security_stream_close(3fc10)
dumper: Building type 4 (FILE) header of size 32768 using:
dumper: Contents of *(dumpfile_t *)28894:
dumper: type = 4 (FILE)
dumper: datestamp= '20130508224500'
dumper: dumplevel= 1
dumper: compressed   = 1
dumper: encrypted= 0
dumper: comp_suffix  = '.gz'
dumper: encrypt_suffix   = 'N'
dumper: name = 'rassilon.crc.isb.nsm'
dumper: disk = '/var/radmind'
dumper: program  = '/bin/tar'
dumper: srvcompprog  = ''
dumper: clntcompprog = ''
dumper: srv_encrypt  = ''
dumper: clnt_encrypt = ''
dumper: recover_cmd  = '/bin/gzip -dc |/bin/tar -f - ...'
dumper: uncompress_cmd   = ''
dumper: encrypt_cmd  = ''
dumper: decrypt_cmd  = ''
dumper: srv_decrypt_opt  = ''
dumper: clnt_decrypt_opt = ''
dumper: cont_filename= ''
dumper: is_partial   = 0
dumper: partnum  = 0
dumper: totalparts   = 0
dumper: blocksize= 32768




On 5/9/13 3:24 PM, Jean-Louis Martineau wrote:

Chris,

look in the amandad debug files on the client and the dumper debug file on the 
server.

Jean-Louis

On 05/09/2013 03:06 PM, Chris Hoogendyk wrote:
On one of my older setups that is running Amanda 2.5.1p3, I'm getting patterns of failures that I 
can't make sense of. Some days everything works just fine. Other days I get everything from a 
couple of DLEs