4.11-RC3: SCSI+UFS+softupdates corruption (write cache DISABLED!)

2005-01-19 Thread Matthias Andree
Hi,

I had a FreeBSD 4.11-RC3 machine reboot without advance notice, the last
logging the network syslogd captured was attempted aic0 (Adaptec 2940 UW
Pro) recovery.

Syslog excerpt as captured by the remote machine, with date and
hostname /kernel: and card state dumps removed (can be provided if
necessary). I wonder if the SCSI error recovery attempts caused the
reboot, I have no hints either way, but this machine is otherwise
stable.

13:28:35 ahc0: Recovery Initiated
13:28:53 (da0:ahc0:0:0:0): SCB 0x16 - timed out
13:28:53 sg[0] - Addr 0x6da3800 : Length 2048
13:28:53 (da0:ahc0:0:0:0): Other SCB Timeout
13:28:53 ahc0: Timedout SCBs already complete. Interrupts may not be 
functioning.
13:28:53 ahc0: Recovery Initiated
13:29:02 (da0:ahc0:0:0:0): SCB 0x1b - timed out

13:29:04 (da0:ahc0:0:0:0): BDR message in message buffer
13:29:04 ahc0: Timedout SCBs already complete. Interrupts may not be 
functioning.
13:29:04 ahc0: Recovery Initiated

13:29:16 Kernel Free SCB list: 9 4 15 20 
13:29:17 sg[7] - Addr 0x3bea000 : Length 4096
13:29:18 ahc0: Issued Channel A Bus Reset. 25 SCBs aborted

As the machine rebooted up, it remained in single user due to
a softupdates inconsistency fsck reported:

| # fsck -p /usr
| /dev/da0s1g: DIRECTORY CORRUPTED  I=175105  OWNER=root MODE=40755
| /dev/da0s1g: SIZE=512 MTIME=Jan 18 15:14 2005 
| /dev/da0s1g: DIR=?
| 
| /dev/da0s1g: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.

I have not yet run fsck for interactive repair, because I want to know
what is going on here and allow debugging this.

At the time of the crash, these tasks were running:

1. amanda was running a dump(8)

2. I was installing manpages from /usr/src/share/man/man4

3. a cvsup for the ports tree was running (this is likely related to the
   problem)

| # fsdb -r /dev/da0s1g
| fsdb (inum: 2) inode 175105
| current inode: directory
| I=175105 MODE=40755 SIZE=512
| MTIME=Jan 18 15:14:48 2005 [0 nsec]
| CTIME=Jan 18 15:14:48 2005 [0 nsec]
| ATIME=Jun 19 03:05:43 2003 [0 nsec]
| OWNER=root GRP=wheel LINKCNT=2 FLAGS=0 BLKCNT=4 GEN=4e5151f9
| fsdb (inum: 175105) cd ..
| component `..': fsdb: name `..' not found in current inode directory

I checked with camcontrol, the write cache is off (see below), but the
queue algorithm modifier is on and cannot be switched off.

Digging through the old structures, with find, reveals:

| 1751014 drwxr-xr-x3 root wheel 512 Sep  1 
 2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd
| 1751024 drwxr-xr-x2 root wheel 512 Sep  1 
 2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd/auto
| 1751034 drwxr-xr-x5 root wheel 512 Aug 23 
 2002 /usr/sup
| 1751044 drwxr-xr-x2 root wheel 512 Jan 19 
13:29 /usr/sup/src-all
 1751054 drwxr-xr-x2 root wheel 512 Jan 18 
 15:14 /usr/sup/ports-all
| 1751064 drwxr-xr-x2 root wheel 512 Jan 18 
15:14 /usr/sup/doc-all
| 1751074 drwxr-xr-x   22 root wheel1024 Sep 28 
19:47 /usr/doc
| 1751084 drwxr-xr-x6 root wheel 512 Dec 19 
13:26 /usr/doc/de_DE.ISO8859-1
| 1751094 drwxr-xr-x5 root wheel 512 Dec 27 
 2003 /usr/doc/de_DE.ISO8859-1/books

And, as expected:

| # ls -la /usr/sup/ports-all/
| #

Why can, under such circumstances, a softupdates filesystem become
corrupt so that fsck -p cannot fix it, and it loses has directories without
. and ..? kernel/softupdates bug? How can this directory become empty?

locate has this information recorded:
/usr/sup/ports-all
/usr/sup/ports-all/#cvs.cvsup-2279.0
/usr/sup/ports-all/checkouts.cvs:.

so apparently, three (checkouts.cvs:., . and ..) or four files (perhaps
the # file) have disappeared. I'm not sure if fsck will revive them, I
want to avoid destroying data useful for debugging.

Is the Queue Algorithm Modifier a problem? (see below) I cannot set this
to 0 on this drive, camcontrol: error sending mode select command with
-P0 and -P3. (Micropolis 4345WS)

How do I go about providing the file system metadata so someone can take
a look at it? The file system is 3.5 G in size, so anything that goes
beyond meta data is not feasible. Providing SSH access to the failed
machine may work though if I'm sent your OpenSSH v2-format key.

# camcontrol inquiry da0
pass0: MICROP 4345WS x43h Fixed Direct Access SCSI-2 device 
pass0: Serial Number 77HT45
pass0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing 
Enabled
# camcontrol modepage da0 -m8
IC:  0
ABPF:  0
CAP:  0
DISC:  0
SIZE:  0
WCE:  0
MF:  0
RCD:  0
...
# camcontrol modepage da0 -m10
RLEC:  0
Queue Algorithm Modifier:  1
QErr:  0
DQue:  0
...

-- 
Matthias Andree
___
freebsd-stable@freebsd.org mailing list

Re: 4.11-RC3: SCSI+UFS+softupdates corruption (write cache DISABLED!)

2005-01-19 Thread Matthias Andree
Matthias Andree [EMAIL PROTECTED] writes:

 so apparently, three (checkouts.cvs:., . and ..) or four files (perhaps
 the # file) have disappeared. I'm not sure if fsck will revive them, I
 want to avoid destroying data useful for debugging.

OK, I dd'd the whole partition to an SLR tape and ran fsck for
interactive repairs.

| ** /dev/da0s1g
| ** Last Mounted on /usr
| ** Phase 1 - Check Blocks and Sizes
| ** Phase 2 - Check Pathnames
| DIRECTORY CORRUPTED  I=175105  OWNER=root MODE=40755
| SIZE=512 MTIME=Jan 18 15:14 2005 
| DIR=?
| 
| UNEXPECTED SOFT UPDATE INCONSISTENCY
| 
| SALVAGE? [yn] y
| 
| MISSING '.'  I=175105  OWNER=root MODE=40755
| SIZE=512 MTIME=Jan 18 15:14 2005 
| DIR=?
| 
| UNEXPECTED SOFT UPDATE INCONSISTENCY
| 
| FIX? [yn] y
| 
| MISSING '..'  I=175105  OWNER=root MODE=40755
| SIZE=512 MTIME=Jan 18 15:14 2005 
| DIR=/sup/ports-all
| 
| UNEXPECTED SOFT UPDATE INCONSISTENCY
| 
| FIX? [yn] y
|
| ** Phase 3 - Check Connectivity
| ** Phase 4 - Check Reference Counts
| UNREF FILE  I=176801  OWNER=root MODE=100644
| SIZE=14098161 MTIME=Jan 18 15:14 2005 
| RECONNECT? [yn] y
| 
| NO lost+found DIRECTORY
| CREATE? [yn] y
| 
| UNREF FILE  I=179558  OWNER=root MODE=100644
| SIZE=8327913 MTIME=Mar 20 03:11 2004 
| RECONNECT? [yn] y
| 
| ** Phase 5 - Check Cyl groups
| FREE BLK COUNT(S) WRONG IN SUPERBLK
| SALVAGE? [yn] y
| 
| SUMMARY INFORMATION BAD
| SALVAGE? [yn] y
| 
| BLK(S) MISSING IN BIT MAPS
| SALVAGE? [yn] y
| 
| 243085 files, 1465923 used, 274252 free (102444 frags, 21476 blocks, 5.9% 
fragmentation)
| 
| * FILE SYSTEM MARKED CLEAN *
| 
| * FILE SYSTEM WAS MODIFIED *

Turns out the missing two files ended up in lost+found.

Is this a failure mode that is allowed to happen for softupdates?

-- 
Matthias Andree
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]