This sounds like a classic case of running out of file descriptors --
either on a per-process basis, or on a system-wide basis (more likely
per-process, as you seem to be able to reproduce it at will with the
same number of disklist entries on that "host").
It seems to me that Amanda should specifically check for the
open/socket/whatever system call that is returning with errno set to
EMFILE (or, on some brain damaged systems, EAGAIN). When that happens,
Amanda should wait for some of the existing connections to be taken down
(i.e., closed).
Cheers,
Marty
"Bernhard R. Erdmann" wrote:
>
> Hi,
>
> I'm using Amanda 2.4.2p2 on/for a Linux Box (RH 6.2, 2.2.19, GNU tar
> 1.13.17) to backup home directories on a NetApp Filer mounted with NFS.
>
> Up to and including 171 disklist entries of type root-tar, everything is
> ok. amcheck complains about the home directories being not accessible
> (amanda has uid 37), but runtar get's them running with euid 0 (NFS
> export with no root squashing). It takes about 3 secs for amcheck to
> check these lines.
>
> If I add some more disklist entries of the same type, amcheck hangs for
> a minute (ctimeout 60) and then reports "selfcheck request timed out.
> Host down?"
>
> /tmp/amanda gets three more files: amanda.<datetime>.debug, amcheck...
> and selfcheck...
> With up to 171 entries, selfcheck.<datetime>.debug grows to 28387 Bytes
> containing 171 lines "could not access". Using 172 entries, it stops at
> 16427 Bytes and contains only 100 lines "could not access" (o.k. because
> of NFS permissions). The last line of the disklist is checked first.
> /tmp/amanda/selfcheck... ends with:
> selfcheck: checking disk /home/User/cb
> selfcheck: device /home/User/cb
> selfcheck: could not access /home/User/cb (/home/User/cb): Permission
> denied
> selfcheck: checking disk /home/User/ca
> selfcheck: device /home/User/ca
>
> After adding one or more lines to the disklist file, only the last 100
> lines get checked, then an amandad and a selfcheck process is hanging
> around:
> $ ps x
> PID TTY STAT TIME COMMAND
> 28833 pts/2 S 0:00 -bash
> 28854 pts/2 S 0:00 emacs -nw disklist
> 29000 pts/1 S 0:00 -bash
> 29149 ? S 0:00 amandad
> 29151 ? S 0:00 /usr/libexec/amanda/selfcheck
> 29182 pts/3 S 0:00 -bash
> 29227 pts/3 S 0:00 less selfcheck.20010511233745.debug
> 29230 pts/1 R 0:00 ps x
>
> Killing selfcheck spaws another selfcheck process and this one's debug
> file stops after having checked the last 100 disklist lines, too.
> $ kill 29151
> $ ps x
> PID TTY STAT TIME COMMAND
> 28833 pts/2 S 0:00 -bash
> 28854 pts/2 S 0:00 emacs -nw disklist
> 29000 pts/1 S 0:00 -bash
> 29182 pts/3 S 0:00 -bash
> 29231 ? S 0:00 amandad
> 29233 ? S 0:00 /usr/libexec/amanda/selfcheck
> 29234 pts/1 R 0:00 ps x
> $ kill 29233
> $ ps x
> PID TTY STAT TIME COMMAND
> 28833 pts/2 S 0:00 -bash
> 28854 pts/2 S 0:00 emacs -nw disklist
> 29000 pts/1 S 0:00 -bash
> 29182 pts/3 S 0:00 -bash
> 29238 ? S 0:00 amandad
> 29240 ? S 0:00 /usr/libexec/amanda/selfcheck
> 29241 pts/1 R 0:00 ps x
> $ kill 29240
> $ ps x
> PID TTY STAT TIME COMMAND
> 28833 pts/2 S 0:00 -bash
> 28854 pts/2 S 0:00 emacs -nw disklist
> 29000 pts/1 S 0:00 -bash
> 29182 pts/3 S 0:00 -bash
> 29244 ? S 0:00 amandad
> 29246 ? D 0:00 /usr/libexec/amanda/selfcheck
> 29247 pts/1 R 0:00 ps x
> $ kill 29246
> $ ps x
> PID TTY STAT TIME COMMAND
> 28833 pts/2 S 0:00 -bash
> 28854 pts/2 S 0:00 emacs -nw disklist
> 29000 pts/1 S 0:00 -bash
> 29182 pts/3 S 0:00 -bash
> 29251 pts/1 R 0:00 ps x
>
> Now it's got killed...
>
> Any ideas?
--
Marty Shannon, RHCE, Independent Computing Consultant
mailto:[EMAIL PROTECTED]