I am having some interesting problems getting amanda 2.5 (or
2.4.2p2) to backup a NetApp using the dump wrapper
(http://groups.yahoo.com/group/amanda-hackers/message/2002). First, amanda
works with non-netapp clients, so I don't think there is anything wrong with
the tape device, holding disk, or client setups. There also does not seem to
be a problem with the dump wrapper because I can issue a dump like "dump 0f
- /filer/vol/vol1 > vol1.dump" and the rsh wrapper will cause the dump to be
rsh executed on the netapp and everything works. However, when I add an
entry to amanda's disklist for the netapp dump, I get the following error in
the report from amanda:

FAILED AND STRANGE DUMP DETAILS:

/-- sancho     /filer/vol/vol1 lev 0 FAILED [no backup size line]
sendbackup: start [sancho:/filer/vol/vol1 level 0]
sendbackup: info BACKUP=/sbin/dump
sendbackup: info RECOVER_CMD=/sbin/restore -f... -
sendbackup: info end
|   DUMP: creating "/vol/vol1/../snapshot_for_backup.333" snapshot.
|   DUMP: Using Full Volume Dump 
|   DUMP: Date of this level 0 dump: Tue Aug 28 15:58:40 2001.
|   DUMP: Date of last level 0 dump: the epoch.
|   DUMP: Dumping /vol/vol1/ to standard output
|   DUMP: mapping (Pass I)[regular files] 
|   DUMP: mapping (Pass II)[directories]
|   DUMP: estimated 27163254 tape blocks.
|   DUMP: dumping (Pass III) [directories]
|   DUMP: dumping (Pass IV) [regular files]
?   DUMP: Error writing to standard output: Undefined error: 0
|   DUMP: DUMP IS ABORTED
|   DUMP: Deleting "/vol/vol1/../snapshot_for_backup.333" snapshot.
sendbackup: error [no backup size line]
\--------

        I've watched the creation of the dump file in the holding disk, and
I've watched the debug files for sendsize and sendbackup - all of the output
appears correct, but the dump consistently fails at Pass IV. I've made an
attempt to strace the pid for the remote dump and indeed it stalls
complaining that it can not write to stdout and then eventually fails. It
dies at the same place every time (Pass IV). However, if I turn off
indexing, the backup works fine every time. This lead me to explore how
amanda creates the index file using tee and restore. At this point, I
discovered that I could duplicate the failure without amanda:

# rsh filer dump 0f - /vol/vol1 | tee vol1.dump | restore -tvf - >
/vol1.index
DUMP: creating "/vol/vol1/../snapshot_for_backup.1744" snapshot.
DUMP: Using Full Volume Dump
DUMP: Date of this level 0 dump: Thu Oct  4 15:37:59 2001.
DUMP: Date of last level 0 dump: the epoch.
DUMP: Dumping /vol/vol1/ to standard output
DUMP: mapping (Pass I)[regular files]
DUMP: mapping (Pass II)[directories]
DUMP: estimated 43903148 KB.
DUMP: dumping (Pass III) [directories]
DUMP: dumping (Pass IV) [regular files]
DUMP: Error writing to standard output: Undefined error: 0
DUMP: DUMP IS ABORTED
DUMP: Deleting "/vol/vol1/../snapshot_for_backup.1744" snapshot.

        As it turns out, one of the volumes on our netapp, vol2, was *not*
failing when I attempted a backup even with indexing. The difference was
that vol2 was essentially empty. I tried copying data from vol1 to vol2
until the backup started failing, but repeated experiments revealed that it
did not matter *what* files I copied so much as *how many* files I copied.
In order to simulate many files, I created 500 directories each with 1000
empty files (named numerically, that is [1-500]/[1-1000]), and indeed the
rsh-pipe-to-restore started failing. If I deleted some files, it would only
fail periodically. If I deleted enough files, it would never fail. If I
created more files, it would always fail. To make it more interesting, the
number of files required to cause failure would change with each run of
tests. Notice also that the failure is observed only when the output of dump
is piped to restore. That is, if I simply re-direct the output of dump to a
file, there are no problems.

        The backup host I had been using up to this point was a Pentium II
333 with 192 MB of RAM running RedHat 7.1. I tried the latest version of
restore, but I got the same results. Since the netapp dump is more similar
to ufsdump, I also tried the same rsh experiment from a Sun Enterprise 450
running Solaris 2.8 (of course the pipe was to ufsrestore). This also
failed. I then tried the same rsh test from a Dell OptiPlex GX110
(PIII-1GHz, 256MB) also running RedHat 7.1 - this time I could not get the
rsh/dump to fail. The two RedHat machines are identical in configuration -
in fact they were both installed from the same media at roughly the same
time. The only difference was the CPU speed/architecture and the quantity of
RAM. However, the rsh test succeeded for a dual Pentium-II 400 also running
RedHat 7.1, so I ruled out CPU architecture. I then tried starting several
background processes on the 1GHz machine to consume RAM and CPU. This time
the rsh test failed, consistently. I then tried backgrounding a few 'while [
"1" ]; do echo > /dev/null; done' loops in the background to busy the CPU
and again the 1GHz machine failed the rsh test. In order to rule out network
bandwidth or connectivity problems, I forced the connection of the 1GHz
machine down to 10MBit, but this did not effect the outcome of the testing.
Consequently, it seems that the pipe to restore is effected solely by
available CPU. 

        The final test I performed was to create 2.5 million empty files on
a Linux server and try the same rsh-dump-pipe-restore test in order to
compare against rsh to the netapp. This time, even the 333MHz had no
problems. In fact, even with the system load up to 8, the slower machine
never failed when executing an rsh-dump-pipe-restore of a linux server
instead of the netapp. 

        In summary then, the simple rsh-pipe-restore command will fail
provided that the remote volume being backed up is on a netapp device
containing many files and the CPU of the client is sufficiently slow. This
is really weird, and appears to be a netapp problem rather than an amanda
problem. However, I imagine there are other people trying to use amanda to
backup netapps and this I am most curious to discover whether anyone else
has observed this problem, or maybe even solved it :)

Thanks,


-poul

Reply via email to