this is networking bug in 2.2.11
upgrade kernel to 2.2.14, get new raid patch and raid tools from
www.redhat.com/~mingo/
allan
Bernd Burgstaller <[EMAIL PROTECTED]> said:
> Dear all!
>
> I am writing this mail due to hangups related to my raid devices. I am
> seeking for suggestions enabling me to locate the problem. Any suggestions
> are welcome! Below you find a description of my system as well as of the
> problems. If you need further information, please let me know.
>
> Best regards & thanks in advance,
> Bernd Burgstaller
>
>
> 0.0 Definitions
>
>
> In the following I use the term 'rescue system' for a non-raid linux
installation.
> 'Production system' denotes the linux installation on raid devices.
>
>
> 1.0 Symptoms
>
>
> The symptoms are always the same: after some uptime (30 seconds up to
several
> hours), the system locks. With X, this results in a frozen screen, switching
> to a textconsole is not possible. Without X, it is sometimes possible to
> switch to other textconsoles and type at the corresponding login prompt.
> However, after pressing return at the login prompt, that console is locked,
> too.
> >From outside the locked system is often still ping-able. When telneting to
the
> locked system, a login prompt occurs for the first telnet attempt, after
> entering a login and pressing return, the session times out. Further telnet
> attempts are refused by the locked system.
> In general TCP connects aren't possible anymore. They are refused by the
kernel,
> e.g. rpcinfo -p HOST
>
> 2.0 Disks
>
>
> The system contains 3 SCSI disks, detected by the kernel as follows:
>
> (scsi0) <Adaptec AIC-7890/1 Ultra2 SCSI host adapter> found at PCI 6/0
> (scsi0) Wide Channel, SCSI ID=7, 32/255 SCBs
> (scsi0) Downloading sequencer code... 374 instructions downloaded
> scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.19/3.2.4
> <Adaptec AIC-7890/1 Ultra2 SCSI host adapter>
> scsi : 1 host.
> (scsi0:0:0:0) Synchronous at 80.0 Mbyte/sec, offset 31.
> Vendor: IBM Model: DNES-309170W Rev: SA30
> Type: Direct-Access ANSI SCSI revision: 03
> Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
> (scsi0:0:2:0) Synchronous at 20.0 Mbyte/sec, offset 15.
> Vendor: PLEXTOR Model: CD-ROM PX-40TS Rev: 1.01
> Type: CD-ROM ANSI SCSI revision: 02
> Detected scsi CD-ROM sr0 at scsi0, channel 0, id 2, lun 0
> (scsi0:0:3:0) Synchronous at 10.0 Mbyte/sec, offset 32.
> Vendor: HP Model: C1537A Rev: L708
> Type: Sequential-Access ANSI SCSI revision: 02
> Detected scsi tape st0 at scsi0, channel 0, id 3, lun 0
> (scsi0:0:13:0) Synchronous at 80.0 Mbyte/sec, offset 31.
> Vendor: IBM Model: DNES-309170W Rev: SA30
> Type: Direct-Access ANSI SCSI revision: 03
> Detected scsi disk sdb at scsi0, channel 0, id 13, lun 0
> (scsi0:0:14:0) Synchronous at 80.0 Mbyte/sec, offset 31.
> Vendor: IBM Model: DNES-309170W Rev: SA30
> Type: Direct-Access ANSI SCSI revision: 03
> Detected scsi disk sdc at scsi0, channel 0, id 14, lun 0
> scsi : detected 1 SCSI tape 1 SCSI cdrom 3 SCSI disks total.
> Uniform CDROM driver Revision: 2.55
> SCSI device sda: hdwr sector= 512 bytes. Sectors= 17916240 [8748 MB] [8.7
GB]
> SCSI device sdb: hdwr sector= 512 bytes. Sectors= 17916240 [8748 MB] [8.7
GB]
> SCSI device sdc: hdwr sector= 512 bytes. Sectors= 17916240 [8748 MB] [8.7
GB]
>
>
> 2.1 Disk Partitions
>
>
> The following figures are reported by fdisk's p(rint) statement:
>
> Disk /dev/sda: 255 heads, 63 sectors, 1115 cylinders
> Units = cylinders of 16065 * 512 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sda1 1 128 1028128+ fd Linux raid autodetect
> /dev/sda2 129 146 144585 fd Linux raid autodetect
> /dev/sda3 147 529 3076447+ fd Linux raid autodetect
> /dev/sda4 530 1115 4707045 5 Extended
> /dev/sda5 530 1039 4096543+ fd Linux raid autodetect
> /dev/sda6 1040 1115 610438+ 82 Linux swap
>
> Disk /dev/sdb: 255 heads, 63 sectors, 1115 cylinders
> Units = cylinders of 16065 * 512 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sdb1 1 128 1028128+ fd Linux raid autodetect
> /dev/sdb2 129 146 144585 fd Linux raid autodetect
> /dev/sdb3 147 529 3076447+ fd Linux raid autodetect
> /dev/sdb4 530 1115 4707045 5 Extended
> /dev/sdb5 530 1039 4096543+ fd Linux raid autodetect
> /dev/sdb6 1040 1115 610438+ 83 Linux
>
> Disk /dev/sdc: 255 heads, 63 sectors, 1115 cylinders
> Units = cylinders of 16065 * 512 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sdc1 1 128 1028128+ 82 Linux swap
> /dev/sdc2 129 256 1028160 83 Linux
> /dev/sdc3 257 1039 6289447+ 83 Linux
> /dev/sdc4 1040 1115 610470 83 Linux
>
>
> 2.2 Relation Partitions - MD Devices
>
>
>
--------------------------------------------------------------------------------
> Disk 1 Raid Dev System Size Mnt(Prod) Mnt(Resc) Notes
>
--------------------------------------------------------------------------------
> /dev/sda1 /dev/md1 raid1-0-0 1.000GB /var
> /dev/sda2 /dev/md0 raid1-2-0 0.140GB /
> /dev/sda3 /dev/md2 raid1-3-0 3.000GB /usr
> /dev/sda5 /dev/md3 raid1-4-0 4.000GB /home
> /dev/sda6 swap 0.600GB none none
Swapspace 2
>
---------------------------------------------------------------------------------
>
>
>
---------------------------------------------------------------------------------
> Disk 2 Raid Dev System Size Mnt(Prod) Mnt(Resc) Notes
>
---------------------------------------------------------------------------------
> /dev/sdb1 /dev/md1 raid1-0-1 1.000GB /var
> /dev/sdb2 /dev/md0 raid1-2-1 0.140GB /
> /dev/sdb3 /dev/md2 raid1-3-1 3.000GB /usr
> /dev/sdb5 /dev/md3 raid1-4-1 4.000GB /home
> /dev/sdb6 ext2 0.600GB / Rescue
Mirror
>
---------------------------------------------------------------------------------
>
>
>
---------------------------------------------------------------------------------
> Disk 3 Raid Dev System Size Mnt(Prod) Mnt(Resc) Notes
>
---------------------------------------------------------------------------------
> /dev/sdc1 swap 1.000GB none none
Swapspace 1
> /dev/sdc2 ext2 1.000GB /tmp /tmp
temporary data
> /dev/sdc3 ext2 6.140GB /V VMWare,
..
> /dev/sdc4 ext2 0.600GB / Rescue
System
>
---------------------------------------------------------------------------------
>
>
> 2.3 Raid Configuration
>
>
> Here's my /etc/raidtab:
>
> raiddev /dev/md1
> raid-level 1
> nr-raid-disks 2
> nr-spare-disks 0
> chunk-size 4
> persistent-superblock 1
> device /dev/sda1
> raid-disk 0
> device /dev/sdb1
> raid-disk 1
>
> raiddev /dev/md0
> raid-level 1
> nr-raid-disks 2
> nr-spare-disks 0
> chunk-size 4
> persistent-superblock 1
> device /dev/sda2
> raid-disk 0
> device /dev/sdb2
> raid-disk 1
>
> raiddev /dev/md2
> raid-level 1
> nr-raid-disks 2
> nr-spare-disks 0
> chunk-size 4
> persistent-superblock 1
> device /dev/sda3
> raid-disk 0
> device /dev/sdb3
> raid-disk 1
>
> raiddev /dev/md3
> raid-level 1
> nr-raid-disks 2
> nr-spare-disks 0
> chunk-size 4
> persistent-superblock 1
> device /dev/sda5
> raid-disk 0
> device /dev/sdb5
> raid-disk 1
>
>
> And the output of cat /proc/mdstat:
>
> guldin:~ # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid5] [translucent]
> read_ahead 1024 sectors
> md1 : active raid1 sdb1[1] sda1[0] 1028032 blocks [2/2] [UU]
> md0 : active raid1 sdb2[1] sda2[0] 144512 blocks [2/2] [UU]
> md2 : active raid1 sdb3[1] sda3[0] 3076352 blocks [2/2] [UU]
> md3 : active raid1 sdb5[1] sda5[0] 4096448 blocks [2/2] [UU]
> unused devices: <none>
>
>
> 2.4 Fstab
>
>
> File /etc/fstab depicts how the md devices make up the filesystem of the
> production system.
>
> /dev/sdc1 swap swap defaults 0 0
> /dev/sda6 swap swap defaults 0 0
> /dev/sdc2 /tmp ext2 defaults 1 2
> /dev/sdc3 /V ext2 defaults 1 2
> /dev/md0 / ext2 defaults 1 1
> /dev/md2 /usr ext2 defaults 1 2
> /dev/md1 /var ext2 defaults 1 2
> /dev/md3 /home ext2 defaults 1 2
>
> /dev/scd0 /cdrom iso9660
ro,noauto,user,exec 0 0
>
> /dev/fd0 /floppy auto noauto,user 0 0
>
> none /proc proc defaults 0 0
> # End of YaST-generated fstab lines
>
>
> 3.0 Kernel
>
>
> I used a stock 2.2.11 kernel, patched it with raid0145-19990824-2.2.11 which
I got
> from ftp://ftp.fi.kernel.org/pub/linux/daemons/raid/alpha. The raidtools
from the
> same location are raidtools-19990824-0.90.tar.gz
>
> Attached to this mail is the kernel configuration I used.
>
>
> 4.0 Installation
>
>
> Suse 6.3 is the used Linux distribution.
>
> Initially I installed the rescue system on /dev/sdc4. On that I patched the
> 2.2.11 kernel, enabled raid support, compiled, and booted the rescue system
> with that kernel.
>
> Next I partitioned /dev/sda and /dev/sdb according Section 2. Mkraid enabled
> the md devices.
>
> >From the rescue system I mounted the md devices under /mnt according to the
fs
> structure given in Section 2.4.
>
> Finally I installed the production system into the /mnt dir, then changed
the
> yast-generated fstab file to match the actual situation.
>
>
> 5.0 Booting
>
>
> Initially I booted the production system from a boot floppy. However, after
> getting the Red Hat patch for lilo V21 I attempted to install lilo in the
MBR.
> Since my /boot dir is also on a md device, this did not work out, because
> lilo did not know how to handle a 0x90 device (the kernel seems to report
> md devices as 0x9.. instead of 0x8..). I changed lilo to accept 0x9.. and
> do the same as if it where a 0x8.. device. I admit that this was a quick
> hack, but note that I experienced the first hang up of the system before
> fiddling with lilo.
>
> Below is my lilo.conf
>
> # Start LILO global Section
> boot=/dev/md0
> linear
> #compact # faster, but won't work on all systems.
> vga=normal
> read-only
> prompt
> timeout=100
> # End LILO global Section
> #
> image = /boot/vmlinuz.raid
> root = /dev/md0
> label = linux
>
> And lilo's comments on it:
>
> boot = /dev/sda, map = /boot/map.0802
> Added linux *
> boot = /dev/sdb, map = /boot/map.0812
> Added linux
>
>
> 6.0 Diagnostics
>
>
> In order to track the problem I enabled syslogd *.* logging on a non-raid
device
> as well as logging over the net to another machine. However, there are no
more
> logs generated as soon as the system locks :-(
>
> In my opinion this could have two reasons:
>
> (1) Either the system is so damaged that even the logging mechanisms are
broken.
>
> Is there a general strategy for logging in such weird cases? Did anybody
> every try to directly access a tty from within the kernel to get through
> the logs via a serial line and capture it on another computer?
>
> (2) The reason for the lock generates no log entry. I do not know whether
there
> are some more compile-switch dependent printk's waiting in the kernel.
Hints
> on kernel switches that make the kernel more verbose would be
appreciated.
> Furthermore I could add further printk's at interesting places,
suggestions
> welcome.
>
>
> 7.0 Notes
>
>
> 7.1 Never did I experience a hang up of the rescue system (despite having it
> compile kernel all night :-) For that reason I suspect that the locks
are
> somehow related to raid.
>
>
> 7.2 Despite the fact that I experienced locks without amd, usage of amd
locks
> the system within minutes. However I have to admit that I did not try
amd
> on the rescue system...
> Note that I have disabled kernel autofs support.
>
>
>