Re: ahc + vinum raid5 deadlocks?

2004-02-21 Thread Tony Frank
Hi,
Cross posting to -scsi as it seems maybe related to my scsi
setup.
At least, the problems currently only appear when the scsi parts
are in use.

On Sat, Feb 21, 2004 at 02:47:49PM +1100, Tony Frank wrote:
> Refer earlier thread on -stable for more background.
> 
> 4.9-STABLE (cvsup 20th Feb)
> Kernel was compiled with DDB, INVARIANTS, DIAGNOSTICS.
> All options removed from /etc/make.conf except 'NOPROFILE=TRUE'
> 
> I get the following on the console before system freezes:
> ahc0: WARNING no command for scb 14 (cmdcmplt)
> QOUTPOS = 44

This occured while doing majority of i/o to vinum raid5 volume.

Since that time I have removed the vinum raid5 configuration
and am using the disks directly - ufs mounted on /dev/da0s1h.

When performing the same load benchmark I have received two
separate panic's:

First panic occurs due to KASSERT in ffs_read, refer:
http://fxr.watson.org/fxr/source//ufs/ufs/ufs_readwrite.c?v=RELENG4#L316


(da0:ahc0:0:0:0): tagged openings now 40
(da1:ahc0:0:1:0): tagged openings now 40
(da2:ahc0:0:2:0): tagged openings now 40
(da3:ahc0:0:3:0): tagged openings now 40
Feb 21 15:48:16 raider su: tony to root on /dev/ttypc
panic: bp->b_resid != 0

syncing disks... Stopped at  siointr1+0x102: movl$0,brk_state2.757
db> trace
siointr1(c0f98000,ccdc6c64,c027d546,c0f98000,10) at siointr1+0x102
siointr(c0f98000,10,0,0,0) at siointr+0xb
Xfastintr4(c1090800,140,600,20002,ccd54300) at Xfastintr4+0x16
lockmgr(c1090800,1030002,ccd5436c,c032c7a0,ccdc6cac) at lockmgr+0x1fc
vop_stdlock(ccdc6cc4,ccdc6cd4,c01bb269,ccdc6cc4,0) at vop_stdlock+0x20
ufs_vnoperatespec(ccdc6cc4) at ufs_vnoperatespec+0x15
vn_lock(ccd54300,20002,c032c7a0,c1129000,0) at vn_lock+0x71
ffs_sync(c1129000,2,c0b2d600,c032c7a0,c1129000) at ffs_sync+0x17f
sync(c032c7a0,0,c02b9660,c02d6048,100) at sync+0x63
boot(100,0,0,ccdc6de0,c02482a2) at boot+0x8a
panic(c02d6048,ccf5cbc0,5b,ccd8f000,400) at panic+0x79
ffs_read(ccdc6df4,0,ccdc6ea8,c02ec1e0,ccf5cbc0) at ffs_read+0x37a
ufs_readlink(ccdc6e38,ccdc6e68,c01b1c8a,ccdc6e38,c7f7bee0) at ufs_readlink+0x6b
ufs_vnoperate(ccdc6e38,c7f7bee0,c7f7bee0,ccdc6f80,c7f7bee0) at ufs_vnoperate+0x15
namei(ccdc6e80,c7f7bee0,2,ccdc6f80,8137400) at namei+0x302
stat(c7f7bee0,ccdc6f80,bfbe7810,bfbe8298,bfbea49c) at stat+0x41
syscall2(c027002f,2f,2f,bfbea49c,bfbe8298) at syscall2+0x209
Xint0x80_syscall() at Xint0x80_syscall+0x25
db>


I have a core from this panic saved.  'trace' is about the extent of
my ddb skills at the moment though.

Checking the archives a similar problem was seen ~6months ago with
particular SCSI disk and having too high tags value.
Refer:
 http://docs.freebsd.org/cgi/mid.cgi?FE045D4D9F7AED4CBFF1B3B813C8533702741FEA

>From what I can see, the ahc driver is forcing a max of 40 tags.
The SCSI hardware is fairly old but was working without problems in
the old system (Win based)

Second panic occurs in ffs_softdep, refer:
http://fxr.watson.org/fxr/source//ufs/ffs/ffs_softdep.c?v=RELENG4#L3590


Feb 21 17:16:30 raider su: tony to root on /dev/ttyp8
(da1:ahc0:0:1:0): tagged openings now 40
Feb 21 17:21:29 raider su: tony to root on /dev/ttyp3
(da3:ahc0:0:3:0): tagged openings now 40
(da2:ahc0:0:2:0): tagged openings now 40
panic: handle_written_inodeblock: live inodedep

syncing disks... Stopped at  siointr1+0x102: movl$0,brk_state2.757
db> trace
siointr1(c0f98000,ccf6bba0,c027d546,c0f98000,10) at siointr1+0x102
siointr(c0f98000,10,0,8,68c040) at siointr+0xb
Xfastintr4(c389c20c,10,c02be0a4,0) at Xfastintr4+0x16
biowait(c389c20c,c106b800,c11d1900,2,c02ec760) at biowait+0x37
bread(ccaaba00,18c040,2000,0,ccf6bc14) at bread+0xb2
ffs_update(ccd51c00,0,68c040,ccd51c00,ccf33580) at ffs_update+0xba
ffs_fsync(ccf6bc78) at ffs_fsync+0x358
ffs_sync(c1069c00,2,c0b2d600,c032c7a0,c1069c00) at ffs_sync+0xdb
sync(c032c7a0,0,c02b9660,c02d56e0,100) at sync+0x63
boot(100,0,c127bb00,ccf6bd18,c024416a) at boot+0x8a
panic(c02d56e0,1,c11a5c80,c38d6178,0) at panic+0x79
handle_written_inodeblock(c11a5c80,c38d6150) at handle_written_inodeblock+0x30e
softdep_disk_write_complete(c38d6150) at softdep_disk_write_complete+0x6a
biodone(c38d6150,1,68c040,c10efa48,c38d6150) at biodone+0x121
complete_rqe(c10efa20,0,c1028c00,f76,c1028d3c) at complete_rqe+0x651
biodone(c10efa20,c1028cb8,c10efa20,c0146d58,c16caac0) at biodone+0xf5
ad_interrupt(c16caac0,c032b7d4,ccf6be38,c0182482,c0fa7900) at ad_interrupt+0x3e7
ata_intr(c0fa7900,c16ccdc0,ccf6be8c,c027e4c2,c032b7d4) at ata_intr+0xd8
add_interrupt_randomness(c032b7d4,0,10,c3870010,c01a0010) at add_interrupt_rando
mness+0xe
Xresume15() at Xresume15+0x2b
--- interrupt, eip = 0xc01ab6e5, esp = 0xccf6be80, ebp = 0xccf6be8c ---
bwillwrite(c16ccdc0,ccf6bf80,cce7b2a0,0,0) at bwillwrite+0x75
dofilewrite(cce7b2a0,c16ccdc0,a,824a00c,f76) at dofilewrite+0xa2
write(cce7b2a0,ccf6bf80,824a00c,bfbff7e0,bfbff800) at write+0x36
syscall2(bfbf002f,bfbf002f,822002f,bfbff800,bfbff7e0) at syscall2+0x209
Xint0x80_syscall() at Xint0x80_syscall+0x25
db>


My searches show 

ahc + vinum raid5 deadlocks?

2004-02-20 Thread Tony Frank
Hi again,
Refer earlier thread on -stable for more background.

4.9-STABLE (cvsup 20th Feb)
Kernel was compiled with DDB, INVARIANTS, DIAGNOSTICS.
All options removed from /etc/make.conf except 'NOPROFILE=TRUE'

I get the following on the console before system freezes:
ahc0: WARNING no command for scb 14 (cmdcmplt)
QOUTPOS = 44

Dropping into DDB I can run 'ps' and see lots of
process in 'flswai' (tar, cvsup etc) plus
others in state 'vrlock' (bufdaemon)

See below for a full 'ps' output.

At this point system was still accessible from network
(ping + login via ssh) but new shells froze.
As mentioned in other email to -stable I can run some
simple remote commands.
I managed to get a remote 'sync' command in and there was a
quick burst of disk activity before it promptly wedged again.
This is repeatable a few times but it just freezes again soon
after.
No further console output.

Hardware is currently Adaptec AHA-2940W S71 (F/w 1.19S8)
Same issues occured with Adaptec AHA-2940UW/B (F/w 1.32S8)
I have 4 x 4G IBM SCSI disks on the single internal 68-pin connector.
Connections are good & termination is correct
The 4 disks are combined into a single vinum raid5 plex.
System is well ventilated and cool to the touch.
Pentium 2 200Mhz, 128mb SDRAM, Asus P2V m/b.

Currently takes ~2 hours of solid activity to trigger the issue.

My 'test bench':
Copying files over NFSv3/udp to vinum raid5 volume via cp & tar
+ make -j4 buildworld
+ Copying large dir trees (/usr/ports, /usr/obj, /usr/src)
+ cvsuping second copy of /usr/src
+ extracting tar archive of /usr/obj

I have second vinum volume (mirrored/concat) on two IDE drives
connected to the m/b and the test bench runs just fine when using
these disks/volume.
It suggest to me an issue between ahc and vinum or vinum & scsi ?

Also seem to be experiencing corruption writing to the raid5 volume.
- Some files do not have correct md5 checksum after copying to the raid5.

Last parts of "vinum info -V" that I got in just before it froze:

13:52:13.709220 4DN Write 0xc387b23813.5270x3d9471  204812  3d9368 
 0   0
13:52:13.710332 4DN Write 0xc387b23813.5350x3d9471  204813  3d9368 
 0   0
13:52:13.710347 Unlock0xc387b238  0xf64ce1
13:52:13.713943 1VS Read  0xc38633a491.0  0x4d2a0   8192
13:52:13.714026 2LR Read  0xc38633a491.0  0x4d2a0   8192
13:52:13.714039 3RQ Read  0xc38633a4116.5350xcd3a9  81925   4d2a0  
 0   0
13:52:13.716327 1VS Write 0xc386378891.5  0x361e10  65536
13:52:13.716733 2LR Write 0xc386378891.5  0x361e10  65536
13:52:13.716746 Lock  0xc3863788  0x361ce1
13:52:13.716752 3RQ Read  0xc386378813.5350xd8971   65536   13  d8868  
 0   0
13:52:13.716844 3RQ Read  0xc3863788116.5190x6e3349 65536   10  d8868  
 0   0
13:52:13.733942 1VS Write 0xc38ab9c091.5  0x15d40e0 8192
13:52:13.734204 2LR Write 0xc38ab9c091.5  0x15d40e0 8192
13:52:13.734218 Lock  0xc38ab9c0  0x15d3da1
13:52:13.734225 3RQ Read  0xc38ab9c013.5430x5753b1  819214  5752a8 
 0   0
13:52:13.734340 3RQ Read  0xc38ab9c0116.5190xb7fd89 819210  5752a8 
 0   0
13:52:13.735175 1VS Write 0xc38bca1891.5  0x361ed0  8192
13:52:13.735287 2LR Write 0xc38bca1891.5  0x361ed0  8192
13:52:13.735322 Lockwait  0xc38bca18  0x361ce1
13:52:13.735679 1VS Read  0xc3890dc891.2  0x42018   1024
13:52:13.735747 2LR Read  0xc3890dc891.2  0x42018   1024
13:52:13.735755 3RQ Read  0xc3890dc8116.5190x142121 10242   42018  
 0   0
13:52:13.740781 4DN Read  0xc386378813.5350xd8971   65536   13  d8868  
 0   0
13:52:13.745513 4DN Read  0xc38ab9c013.5430x5753b1  819214  5752a8 
 0   0
13:52:13.765800 4DN Read  0xc38633a4116.5350xcd3a9  81925   4d2a0  
 0   0

and then again after I did a sync:

Flags: 0x80204
Total of 120 blocks malloced, total memory: 253624
Maximum allocs:  732, malloc table at 0xc03bc7c0
10 requests active, maximum 419 active

Time Event   BufDev   OffsetBytes   SD  SDoff  
 Doffset Goffset

13:57:52.971644 3RQ Write 0xc38a8348116.5350x5f4149 81929   3f4040 
 0   0
13:57:52.972151 4DN Write 0xc38a8348116.5350x5f4149 81929   3f4040 
 0   0
13:57:52.972241 4DN Write 0xc38a8348116.5190x5f4149 81924   3f4040 
 0   0
13:57:52.973372 1VS Write 0xc389ef7091.0  0x2c070   8192
13:57:52.973452 2LR Write 0xc389ef7091.0  0x2c070   8192
13:57:52.973461 3RQ Write 0xc389ef70116.5190xac179  81920   2c070  
 0   0
13:57:52.973518 3RQ Write 0xc389ef70116.5350xac179  81925   2c070  
 0   0
13:57:52.973981 4DN Write 0xc389ef70116.5190xac179  81920