I'm in the process of upgrading from 2.5.0p2 to 2.6.1p2 and from an SAIT drive to an LTO-4 drive on the backup server (CentOS 5.4, 64-bit). This is the second time I've seen amcheck knock my tape drive offline. Here's the scenario that led to the problem:

I successfully ran an amdump overnight onto the new drive using tape CISDAILY-100 which is the tape in slot 1 of our changer. This morning, with CISDAILY-100 still in the drive, I ran an amcheck, which I would have expected to cause the next tape to load, which is the tape in slot 2, CISDAILY-101.

amcheck immediately reported "Tape with label CISDAILY-100 is still active and cannot be overwritten." and then hung for 5-10 minutes. I eventually ran an mtx status, and I confirmed that the changer successfully swapped out CISDAILY-100 for CISDAILY-101 into the tape drive, but amcheck doesn't seem to have noticed.

After 5-10 minutes of hanging, amcheck starts to report errors like:

slot 2:Error reading label: Can't open tape device /dev/nst1: No such device or address. changer error: slot <error>: Malformed output from changer script -- too many lines changer error: slot <error>: Malformed output from changer script -- too many lines changer error: slot <error>: Malformed output from changer script -- too many lines
[repeat this set of messages again]

and mt -f /dev/nst1 during the hang reported:

# mt -f /dev/nst1 status
/dev/nst1: Device or resource busy

and then a little later after amcheck reported some of the above errors:

# mt -f /dev/nst1 status
/dev/nst1: No such device or address

dmesg shows a string of errors ending in tape device being offlined:

st 3:0:2:0: Attempting to queue an ABORT message:CDB: 0x8 0x1 0x0 0x0 0x0 0x0
scsi3: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi3: Dumping Card State at program address 0x1b6 Mode 0x0
Card was paused
INTSTAT[0x0] SELOID[0x2] SELID[0x20] HS_MAILBOX[0x0]
INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x23]
SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] LASTPHASE[0x1]
SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x6]
SEQ_FLAGS[0x0] SEQ_FLAGS2[0x4] QFREEZE_COUNT[0x8]
KERNEL_QFREEZE_COUNT[0x7] MK_MESSAGE_SCB[0xff00]
MK_MESSAGE_SCSIID[0xff] SSTAT0[0x0] SSTAT1[0x8]
SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xa4]
LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0]
LQOSTAT1[0x0] LQOSTAT2[0xe1]

SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x3 NEXTSCB 0xffc0
qinstart = 53083 qinfifonext = 53083
QINFIFO:
WAITING_TID_QUEUES:
Pending list:
  3 FIFO_USE[0x0] SCB_CONTROL[0x68] SCB_SCSIID[0x27]
Total 1
Kernel Free SCB list: 2 1 0
Sequencer Complete DMA-inprog list:
Sequencer Complete list:
Sequencer DMA-Up and Complete list:
Sequencer On QFreeze and Complete list:


scsi3: FIFO0 Active, LONGJMP == 0x82a9, SCB 0x3
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x88]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x2]
SOFFCNT[0x0] MDFFSTAT[0x44] SHADDR = 0x0608, SHCNT = 0xfff9f8
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]

scsi3: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
scsi3: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi3: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
scsi3: SAVED_SCSIID = 0x0 SAVED_LUN = 0x0
SIMODE0[0xc]
CCSCBCTL[0x4]
scsi3: REG0 == 0x7a60, SINDEX = 0x112, DINDEX = 0x106
scsi3: SCBPTR == 0x3, SCB_NEXT == 0xffc0, SCB_NEXT2 == 0xff6d
CDB 8 1 0 0 2 0
STACK: 0xe3 0x34 0x23 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
st 3:0:2:0: Unable to deliver message
scsi3: Command abort returning 0x2003
st 3:0:2:0: Attempting to queue a TARGET RESET message:CDB: 0x8 0x1 0x0 0x0 0x0 0x0
scsi3: Device reset code sleeping
scsi3: Device reset timer expired (active 2)
scsi3: Device reset returning 0x2003
Recovery SCB completes
st 3:0:2:0: Attempting to queue an ABORT message:CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi3: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi3: Dumping Card State at program address 0x24 Mode 0x22
Card was paused
INTSTAT[0x0] SELOID[0x2] SELID[0x20] HS_MAILBOX[0x0]
INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33]
SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] LASTPHASE[0x1]
SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0]
SEQ_FLAGS[0x0] SEQ_FLAGS2[0x4] QFREEZE_COUNT[0x8]
KERNEL_QFREEZE_COUNT[0x7] MK_MESSAGE_SCB[0xff00]
MK_MESSAGE_SCSIID[0xff] SSTAT0[0x0] SSTAT1[0x0]
SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xa4]
LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0]
LQOSTAT1[0x0] LQOSTAT2[0xe1]

SCB Count = 4 CMDS_PENDING = 0 LASTSCB 0xffff CURRSCB 0x3 NEXTSCB 0xffc0
qinstart = 53083 qinfifonext = 53084
QINFIFO: 0x3
WAITING_TID_QUEUES:
Pending list:
  3 FIFO_USE[0x0] SCB_CONTROL[0x68] SCB_SCSIID[0x27]
Total 1
Kernel Free SCB list: 2 1 0
Sequencer Complete DMA-inprog list:
Sequencer Complete list:
Sequencer DMA-Up and Complete list:
Sequencer On QFreeze and Complete list:


scsi3: FIFO0 Free, LONGJMP == 0x80a9, SCB 0x3
SEQIMODE[0x3f] SEQINTSRC[0x40] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]

scsi3: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
scsi3: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi3: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
scsi3: SAVED_SCSIID = 0x0 SAVED_LUN = 0x0
SIMODE0[0xc]
CCSCBCTL[0x4]
scsi3: REG0 == 0x3, SINDEX = 0x112, DINDEX = 0x106
scsi3: SCBPTR == 0xff03, SCB_NEXT == 0xff00, SCB_NEXT2 == 0x0
CDB 3 1 0 0 0 0
STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
scsi3:0:2:0: Cmd aborted from QINFIFO
st 3:0:2:0: scsi: Device offlined - not ready after error recovery
st1: Error 70000 (sugg. bt 0x0, driver bt 0x0, host bt 0x7).

Any thoughts as to where the problem is? Then changer seems to have done the right thing by loading the next tape, but the tape device wasn't happy about something that amcheck asked it to do? Perhaps I have something mis-configured with the way I set up the changer and device in amanda.conf? Here are my entries:

tpchanger "chg-zd-mtx"          # the tape-changer glue script
tapedev "/dev/nst1"     # the new LTO-4 device
changerfile "/usr/local/etc/amanda/CISDAILY/chg-zd-mtx"
changerdev "/dev/changer"

define tapetype SONY-LTO4 {
        comment "Sony LTO4"
        length 780000 mbytes
        filemark 0 kbytes
        speed 120000 kps
        blocksize 512
}

I've done several manual mt / mtx tests of the changer and this new drive and have written plenty of data to it using dd and tar and it seems stable. I found I had to do this at server boot:

mt -f /dev/st1 compression 0
mt -f /dev/st1 defcompression 0
mt -f /dev/st1 defblksize 524288
mt -f /dev/st1 setblk 512k

In particular, keeping setblk at 0 led to errors when dd or tar tried to use any blocksize bigger than 512k (but as you see above, I've set amanda's tapetype to use 512k blocks anyhow).

Any ideas what's going wrong?

Thanks,
Fran


Reply via email to