We're having problems getting amcheckdump (Amanda 3.1.1) to run
on Solaris 10 x86.
Amanda itself seems to run fine and tells us it's writing ~600Gb
a night to tape. We've checked this by repeatedly running
"dd if=/dev/rmt/0n bs=32k skip=1 | gtar -tf -" to step through
the tape contents and they look fine.
The problem is that amcheckdump simply "hangs" shortly after
invocation:
-> $ amcheckdump t1 --timestamp 20100731005710
-> You will need the following tape: HBK1_13
-> Press enter when ready
->
-> Validating image merlin:/etc datestamp 20100731005710 level 0 part 1/1 on ->
tape HBK1_13 file #1
-> (nothing else happens after this)
Using "ptree" to dump a process tree for the amcheckdump process
indicates that it has gotten as far as spawing gtar:
14924 /usr/bin/perl /home/amanda/THORBAK1/sbin/amcheckdump t1
14927 /usr/sfw/bin/gtar -tf -
Using "perl -d" to step through the amcheckdump perl code indicates
that the hang occurs at line 615 in a call to the read_to_fd() routine:
613: # send the datastream from the device straight to the application
614: my $queue_fd = Amanda::Device::queue_fd_t->new(fileno($pipeline));
615: my $read_ok = $device->read_to_fd($queue_fd);
616: $check->($device->status() == $DEVICE_STATUS_SUCCESS,
"Error reading device: " . $device->error_or_status());
Using "truss -f -wall -rall <command>" to see what's happening at
the system level shows that amanda successfully:
- opens the tape device
- reads in the tape header (AMANDA: TAPESTART DATE ...)
- starts the amgtar program in a pipe with the "validate" argument
- starts three light weight processes (threads)
- starts alternately reading data (from the tape?) using
thread 2 and writing using thread 3 (to /dev/null due to the
validate option?)
This goes on for a while but then stops abruptly at which point
truss shows the following output:
14868/2: read(7, 0x08810AF0, 32768) = 32768
14868/2: ... lots of reading ...
14868/3: write(8, 0x08820B00, 32768) (sleeping...)
14868/2: lwp_park(0x00000000, 0) (sleeping...)
To us, it appears like the thread #2 read from fd7 is okay, but the
thread #3 write to fd8 goes to sleep for some reason. At this
point thread 2 calls lwp_park (light weight process park?) and
everything stops.
The amcheckdump debug file doesn't seem to contain any
obvious clues:
Sun Aug 1 18:43:40 2010: amcheckdump: pid 14868 ruid 135 euid 135 version
3.1.1: start at Sun Aug 1 18:43:40 2010
Sun Aug 1 18:43:40 2010: amcheckdump: pid 14868 ruid 135 euid 135 version
3.1.1: rename at Sun Aug 1 18:43:40 2010
Sun Aug 1 18:43:44 2010: amcheckdump: Validating image merlin:/etc datestamp
20100731005710 level 0 part 1/1on tape HBK1_13 file #1Sun Aug 1 18:43:44 2010:
amcheckdump: chg-robot: using statefile
'/home/amanda/THORBAK1/var/t1/changer-state'
Sun Aug 1 18:43:44 2010: amcheckdump: find_volume labeled 'HBK1_13'
Sun Aug 1 18:43:44 2010: amcheckdump: invoking
/home/amanda/THORBAK1/MTX/sbin/mtx -f /dev/scsi/changer/c7t4d1 status
Sun Aug 1 18:43:44 2010: amcheckdump: myrobot: updating state
Sun Aug 1 18:43:44 2010: amcheckdump: parse_inventory: load slot 1 with label
'HBK1_13'
Sun Aug 1 18:43:44 2010: amcheckdump: myrobot: too early for another 'status'
invocation
Sun Aug 1 18:43:44 2010: amcheckdump: myrobot: loading slot '1'
Sun Aug 1 18:43:44 2010: amcheckdump: myrobot: requested volume is already in
drive 0
Sun Aug 1 18:43:44 2010: amcheckdump: myrobot: polling 'tape:/dev/rmt/0n' to
see if it's ready
Sun Aug 1 18:43:49 2010: amcheckdump: invoking
/home/amanda/THORBAK1/MTX/sbin/mtx -f /dev/scsi/changer/c7t4d1 status
Sun Aug 1 18:43:49 2010: amcheckdump: myrobot: updating state
Sun Aug 1 18:43:53 2010: amcheckdump: using
'/home/amanda/THORBAK1/libexec/amanda/application/amgtar validate'
I did find a recent post (http://forums.zmanda.com/showthread.php?t=2901)
describing a similar issue that suggested that updating our Perl
version might help. Unfortunately, updating from 5.8.4 to 5.10.1 didn't
make any difference.
If anyone has any ideas we'd appreciate hearing them!
Thanks,
Sean Walmsley / Glenn Cantello