Re: Raid-Related System Locks

Mike Black Tue, 28 Mar 2000 07:16:04 -0800
please note:"Ultra" SCSI cable lengths are severely limited! The maximum
cable length is ten feet when four devices (including the host adapter) or
less are on the bus. If five devices are used (four devices and your host
adapter), then the maximum bus length is 1.5 meters (five feet!).

As I count your bus you have 6 devices -- with a minimum 6 inches between
you're already at 36 inches so I sincerely doubt you're under 5 feet total
length (given that you probably have a 1 meter cable from the chassis to the
card).

________________________________________
Michael D. Black   Principal Engineer
[EMAIL PROTECTED]  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Bernd Burgstaller" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 28, 2000 8:48 AM
Subject: Raid-Related System Locks


Dear all!

I am writing this mail due to hangups related to my raid devices. I am
seeking for suggestions enabling me to locate the problem. Any suggestions
are welcome! Below you find a description of my system as well as of the
problems. If you need further information, please let me know.

Best regards & thanks in advance,
Bernd Burgstaller


0.0 Definitions


In the following I use the term 'rescue system' for a non-raid linux
installation.
'Production system' denotes the linux installation on raid devices.


1.0 Symptoms


The symptoms are always the same: after some uptime (30 seconds up to
several
hours), the system locks. With X, this results in a frozen screen, switching
to a textconsole is not possible. Without X, it is sometimes possible to
switch to other textconsoles and type at the corresponding login prompt.
However, after pressing return at the login prompt, that console is locked,
too.
>From outside the locked system is often still ping-able. When telneting to
the
locked system, a login prompt occurs for the first telnet attempt, after
entering a login and pressing return, the session times out. Further telnet
attempts are refused by the locked system.
In general TCP connects aren't possible anymore. They are refused by the
kernel,
e.g. rpcinfo -p HOST

2.0 Disks


The system contains 3 SCSI disks, detected by the kernel as follows:

(scsi0) <Adaptec AIC-7890/1 Ultra2 SCSI host adapter> found at PCI 6/0
(scsi0) Wide Channel, SCSI ID=7, 32/255 SCBs
(scsi0) Downloading sequencer code... 374 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.19/3.2.4
       <Adaptec AIC-7890/1 Ultra2 SCSI host adapter>
scsi : 1 host.
(scsi0:0:0:0) Synchronous at 80.0 Mbyte/sec, offset 31.
  Vendor: IBM       Model: DNES-309170W      Rev: SA30
  Type:   Direct-Access                      ANSI SCSI revision: 03
Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
(scsi0:0:2:0) Synchronous at 20.0 Mbyte/sec, offset 15.
  Vendor: PLEXTOR   Model: CD-ROM PX-40TS    Rev: 1.01
  Type:   CD-ROM                             ANSI SCSI revision: 02
Detected scsi CD-ROM sr0 at scsi0, channel 0, id 2, lun 0
(scsi0:0:3:0) Synchronous at 10.0 Mbyte/sec, offset 32.
  Vendor: HP        Model: C1537A            Rev: L708
  Type:   Sequential-Access                  ANSI SCSI revision: 02
Detected scsi tape st0 at scsi0, channel 0, id 3, lun 0
(scsi0:0:13:0) Synchronous at 80.0 Mbyte/sec, offset 31.
  Vendor: IBM       Model: DNES-309170W      Rev: SA30
  Type:   Direct-Access                      ANSI SCSI revision: 03
Detected scsi disk sdb at scsi0, channel 0, id 13, lun 0
(scsi0:0:14:0) Synchronous at 80.0 Mbyte/sec, offset 31.
  Vendor: IBM       Model: DNES-309170W      Rev: SA30
  Type:   Direct-Access                      ANSI SCSI revision: 03
Detected scsi disk sdc at scsi0, channel 0, id 14, lun 0
scsi : detected 1 SCSI tape 1 SCSI cdrom 3 SCSI disks total.
Uniform CDROM driver Revision: 2.55
SCSI device sda: hdwr sector= 512 bytes. Sectors= 17916240 [8748 MB] [8.7
GB]
SCSI device sdb: hdwr sector= 512 bytes. Sectors= 17916240 [8748 MB] [8.7
GB]
SCSI device sdc: hdwr sector= 512 bytes. Sectors= 17916240 [8748 MB] [8.7
GB]


2.1 Disk Partitions


The following figures are reported by fdisk's p(rint) statement:

Disk /dev/sda: 255 heads, 63 sectors, 1115 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sda1             1       128   1028128+  fd  Linux raid autodetect
/dev/sda2           129       146    144585   fd  Linux raid autodetect
/dev/sda3           147       529   3076447+  fd  Linux raid autodetect
/dev/sda4           530      1115   4707045    5  Extended
/dev/sda5           530      1039   4096543+  fd  Linux raid autodetect
/dev/sda6          1040      1115    610438+  82  Linux swap

Disk /dev/sdb: 255 heads, 63 sectors, 1115 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sdb1             1       128   1028128+  fd  Linux raid autodetect
/dev/sdb2           129       146    144585   fd  Linux raid autodetect
/dev/sdb3           147       529   3076447+  fd  Linux raid autodetect
/dev/sdb4           530      1115   4707045    5  Extended
/dev/sdb5           530      1039   4096543+  fd  Linux raid autodetect
/dev/sdb6          1040      1115    610438+  83  Linux

Disk /dev/sdc: 255 heads, 63 sectors, 1115 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sdc1             1       128   1028128+  82  Linux swap
/dev/sdc2           129       256   1028160   83  Linux
/dev/sdc3           257      1039   6289447+  83  Linux
/dev/sdc4          1040      1115    610470   83  Linux


2.2 Relation Partitions - MD Devices


----------------------------------------------------------------------------
----
 Disk 1        Raid Dev  System     Size      Mnt(Prod)  Mnt(Resc)  Notes
----------------------------------------------------------------------------
----
 /dev/sda1     /dev/md1  raid1-0-0  1.000GB   /var
 /dev/sda2     /dev/md0  raid1-2-0  0.140GB   /
 /dev/sda3     /dev/md2  raid1-3-0  3.000GB   /usr
 /dev/sda5     /dev/md3  raid1-4-0  4.000GB   /home
 /dev/sda6               swap       0.600GB    none      none
Swapspace 2
----------------------------------------------------------------------------
-----


----------------------------------------------------------------------------
-----
 Disk 2        Raid Dev  System     Size      Mnt(Prod)  Mnt(Resc)  Notes
----------------------------------------------------------------------------
-----
 /dev/sdb1     /dev/md1  raid1-0-1  1.000GB   /var
 /dev/sdb2     /dev/md0  raid1-2-1  0.140GB   /
 /dev/sdb3     /dev/md2  raid1-3-1  3.000GB   /usr
 /dev/sdb5     /dev/md3  raid1-4-1  4.000GB   /home
 /dev/sdb6               ext2       0.600GB              /          Rescue
Mirror
----------------------------------------------------------------------------
-----


----------------------------------------------------------------------------
-----
 Disk 3        Raid Dev  System   Size       Mnt(Prod)  Mnt(Resc)  Notes
----------------------------------------------------------------------------
-----
 /dev/sdc1               swap     1.000GB     none       none
Swapspace 1
 /dev/sdc2               ext2     1.000GB     /tmp       /tmp
temporary data
 /dev/sdc3               ext2     6.140GB     /V                    VMWare,
...
 /dev/sdc4               ext2     0.600GB                /          Rescue
System
----------------------------------------------------------------------------
-----


2.3 Raid Configuration


Here's my /etc/raidtab:

raiddev /dev/md1
        raid-level      1
        nr-raid-disks   2
        nr-spare-disks  0
        chunk-size      4
        persistent-superblock   1
        device          /dev/sda1
        raid-disk       0
        device          /dev/sdb1
        raid-disk       1

raiddev /dev/md0
        raid-level      1
        nr-raid-disks   2
        nr-spare-disks  0
        chunk-size      4
        persistent-superblock   1
        device          /dev/sda2
        raid-disk       0
        device          /dev/sdb2
        raid-disk       1

raiddev /dev/md2
        raid-level      1
        nr-raid-disks   2
        nr-spare-disks  0
        chunk-size      4
        persistent-superblock   1
        device          /dev/sda3
        raid-disk       0
        device          /dev/sdb3
        raid-disk       1

raiddev /dev/md3
        raid-level      1
        nr-raid-disks   2
        nr-spare-disks  0
        chunk-size      4
        persistent-superblock   1
        device          /dev/sda5
        raid-disk       0
        device          /dev/sdb5
        raid-disk       1


And the output of cat /proc/mdstat:

guldin:~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [translucent]
read_ahead 1024 sectors
md1 : active raid1 sdb1[1] sda1[0] 1028032 blocks [2/2] [UU]
md0 : active raid1 sdb2[1] sda2[0] 144512 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0] 3076352 blocks [2/2] [UU]
md3 : active raid1 sdb5[1] sda5[0] 4096448 blocks [2/2] [UU]
unused devices: <none>


2.4 Fstab


File /etc/fstab depicts how the md devices make up the filesystem of the
production system.

/dev/sdc1       swap                      swap            defaults   0   0
/dev/sda6       swap                      swap            defaults   0   0
/dev/sdc2       /tmp                      ext2            defaults   1   2
/dev/sdc3       /V                        ext2            defaults   1   2
/dev/md0        /                         ext2            defaults   1   1
/dev/md2        /usr                      ext2            defaults   1   2
/dev/md1        /var                      ext2            defaults   1   2
/dev/md3        /home                     ext2            defaults   1   2

/dev/scd0       /cdrom                    iso9660
ro,noauto,user,exec 0   0

/dev/fd0        /floppy                   auto            noauto,user 0   0

none            /proc                     proc            defaults   0   0
# End of YaST-generated fstab lines


3.0 Kernel


I used a stock 2.2.11 kernel, patched it with raid0145-19990824-2.2.11 which
I got
from ftp://ftp.fi.kernel.org/pub/linux/daemons/raid/alpha. The raidtools
from the
same location are raidtools-19990824-0.90.tar.gz

Attached to this mail is the kernel configuration I used.


4.0 Installation


Suse 6.3 is the used Linux distribution.

Initially I installed the rescue system on /dev/sdc4. On that I patched the
2.2.11 kernel, enabled raid support, compiled, and booted the rescue system
with that kernel.

Next I partitioned /dev/sda and /dev/sdb according Section 2. Mkraid enabled
the md devices.

>From the rescue system I mounted the md devices under /mnt according to the
fs
structure given in Section 2.4.

Finally I installed the production system into the /mnt dir, then changed
the
yast-generated fstab file to match the actual situation.


5.0 Booting


Initially I booted the production system from a boot floppy. However, after
getting the Red Hat patch for lilo V21 I attempted to install lilo in the
MBR.
Since my /boot dir is also on a md device, this did not work out, because
lilo did not know how to handle a 0x90 device (the kernel seems to report
md devices as 0x9.. instead of 0x8..). I changed lilo to accept 0x9.. and
do the same as if it where a 0x8.. device. I admit that this was a quick
hack, but note that I experienced the first hang up of the system before
fiddling with lilo.

Below is my lilo.conf

# Start LILO global Section
boot=/dev/md0
linear
#compact       # faster, but won't work on all systems.
vga=normal
read-only
prompt
timeout=100
# End LILO global Section
#
image = /boot/vmlinuz.raid
  root = /dev/md0
  label = linux

And lilo's comments on it:

boot = /dev/sda, map = /boot/map.0802
Added linux *
boot = /dev/sdb, map = /boot/map.0812
Added linux


6.0 Diagnostics


In order to track the problem I enabled syslogd *.* logging on a non-raid
device
as well as logging over the net to another machine. However, there are no
more
logs generated as soon as the system locks :-(

In my opinion this could have two reasons:

(1) Either the system is so damaged that even the logging mechanisms are
broken.

    Is there a general strategy for logging in such weird cases? Did anybody
    every try to directly access a tty from within the kernel to get through
    the logs via a serial line and capture it on another computer?

(2) The reason for the lock generates no log entry. I do not know whether
there
    are some more compile-switch dependent printk's waiting in the kernel.
Hints
    on kernel switches that make the kernel more verbose would be
appreciated.
    Furthermore I could add further printk's at interesting places,
suggestions
    welcome.


7.0 Notes


7.1 Never did I experience a hang up of the rescue system (despite having it
    compile kernel all night :-) For that reason I suspect that the locks
are
    somehow related to raid.


7.2 Despite the fact that I experienced locks without amd, usage of amd
locks
    the system within minutes. However I have to admit that I did not try
amd
    on the rescue system...
    Note that I have disabled kernel autofs support.
Re: Raid-Related System Locks

Reply via email to