Re: Computer stops responding (freezes up) during uncorrectable data error

Amit Kulkarni Wed, 26 Jan 2011 19:58:36 -0800

pardon my ignorance but if you restored your data already, why bother
investigating disk failure?


On Wed, Jan 26, 2011 at 6:50 PM, Gordon Ferris
<gordon.fer...@wfengineering.com> wrote:
>        I have a disk that has failed; there seem to be damaged areas that
cause errors when specific files are accessed.  This disk was one of a
two-disk mirror running raidframe.  The disk has been replaced and the
original machine is back up and running again.
>        However as I use a second computer to investigate the failed disk, I
have been puzzled that this second computer locks up and stops responding when
I try copying files that include various damaged areas of the disk.
>
>        This second computer has an installation of OpenBSD 4.6, with the
kernel recompiled to support raidframe (so I can access the data on the
partition); I have also adjusted the drive numbering so that the failed drive
believes it is the only disk present in its mirror.  On this second computer,
the operating system is on a completely different physical disk; the failed
disk is not necessary for a completely functional system.
>        However, even though this computer doesn't use the failed disk for
its root filesystem - the computer still freezes up and stops responding when
the bad sectors are accessed.
>        I even tried using the "dump" and "dd" utilities to access the disk
with a raw, unmounted partition - but the host computer still freezes up and
stops responding after adding a few lines to /var/log/messages.
>
>        I was expecting the error messages, but not expecting the host system
to freeze up - even the mouse stops responding.  It's irritating to have to
reboot the computer each time I access one of the damaged sectors.
>        I thought this problem might be caused if the drive controller
hardware never returns control back to the operating system once the disk
error occurs too many times.  But the error messages do end up in
/var/log/messages, so control does return to the operating system for at least
a little while.
>
>        And yes, repeatedly accessing the same file generates the error
messages referring to the same sectors.
>
> 1.  How can I attempt to access the damaged sectors without causing the
entire computer to freeze up and stop responding?
>
> 2.  I have used stat, ncheck, and fsdb to find and examine the inodes for
various files.  Is there a utility to show which sectors of the filesystem
and/or the drive are actually used by various files?
>
> 3.  How can I identify all the files that contain bad sectors without
freezing up the computer on each file that contains one?
>
> # mount
> /dev/wd1a on / type ffs (local)
> /dev/wd1e on /usr type ffs (local, read-only)
> /dev/wd1g on /mnt3 type ffs (local, read-only)
> /dev/wd1f on /mnt type ffs (local, read-only)
> # fsck -f /dev/rraid2d
> ** /dev/rraid2d
> ** File system is already clean
> ** Last Mounted on /home-big
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> 452600 files, 69774853 used, 43730370 free (26658 frags, 5462964 blocks,
0.0% fr
> agmentation)
>
> # mount -r /dev/raid2d /mnt2
> # mount
> /dev/wd1a on / type ffs (local)
> /dev/wd1e on /usr type ffs (local, read-only)
> /dev/wd1g on /mnt3 type ffs (local, read-only)
> /dev/wd1f on /mnt type ffs (local, read-only)
> /dev/raid2d on /mnt2 type ffs (local, read-only)
>
> # dd conv=noerror,notrunc,sync \
>> if=/mnt2/.../20198332.txt of=/dev/null count=1
>
>        The computer stopped responding but these messages were on the
console and in /var/log/messages on rebooting:
> /var/log/messages
> Jan 26 08:23:15 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
> f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
> Jan 26 08:23:18 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
4
> Jan 26 08:23:18 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
4
> Jan 26 08:23:18 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
> f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
> Jan 26 08:23:20 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
3
> Jan 26 08:23:20 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
3
> Jan 26 08:23:20 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
> f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
> Jan 26 08:23:22 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
2
> Jan 26 08:23:22 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
2
> Jan 26 08:23:22 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
> f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
> Jan 26 08:23:25 one /bsd: wd0f: uncorrectable data error reading fsbn
40104976 o
> f 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying
>
>        And the error messages are repeatable (especially the failed block
numbers) if I repeat the command:
> Jan 26 10:40:19 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34),
retrying
> Jan 26 10:40:21 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34),
retrying
> Jan 26 10:40:24 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34),
retrying
> Jan 26 10:40:26 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
4
> Jan 26 10:40:26 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
4
> Jan 26 10:40:26 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34),
retrying
> Jan 26 10:40:29 one /bsd: wd0f: uncorrectable data error reading fsbn
40104976 of 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58),
retrying
>
>        However, none of these commands seem to cause any problem - no error
messages and no freezing up:
> # dd conv=noerror,notrunc,sync \
>> if=/dev/wd0f skip=40104951 of=/dev/null count=1
> 1+0 records in
> 1+0 records out
> 512 bytes transferred in 0.014 secs (34222 bytes/sec)
> # dd conv=noerror,notrunc,sync \
>> if=/dev/wd0f skip=40104952 of=/dev/null count=1
> 1+0 records in
> 1+0 records out
> 512 bytes transferred in 0.000 secs (2612245 bytes/sec)
> # dd conv=noerror,notrunc,sync \
>> if=/dev/wd0f skip=67174501 of=/dev/null count=1
> 1+0 records in
> 1+0 records out
> 512 bytes transferred in 0.011 secs (43813 bytes/sec)
> # dd conv=noerror,notrunc,sync \
>> if=/dev/raid2d skip=40104952 of=/dev/null count=1
> dd: /dev/raid2d: Device busy
> # umount /mnt2
> # dd conv=noerror,notrunc,sync \
>> if=/dev/raid2d skip=40104952 of=/dev/null count=1
> 1+0 records in
> 1+0 records out
> 512 bytes transferred in 0.013 secs (37083 bytes/sec)
> #
>
>        Here is the dmesg:
> OpenBSD 4.6 (RAID110125) #0: Tue Jan 25 03:11:29 MST 2011
>    r...@one.my.domain:/usr/src/sys/arch/i386/compile/RAID110125
> cpu0: Intel(R) Pentium(R) 4 CPU 3.40GHz ("GenuineIntel" 686-class) 3.40 GHz
> cpu0:
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUS
H,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,SSE3,MWAIT,DS-CPL,EST,CNXT-ID,CX16,
xTPR
> real mem  = 1073246208 (1023MB)
> avail mem = 1028554752 (980MB)
> mainbus0 at root
> bios0 at mainbus0: AT/286+ BIOS, date 03/10/05, BIOS32 rev. 0 @ 0xfaad0,
SMBIOS rev. 2.3 @ 0xf0100 (25 entries)
> bios0: vendor Phoenix Technologies Ltd. version "F2" date 03/10/2005
> bios0: Gigabyte Technology Co., Ltd. 0000000000
> acpi0 at bios0: rev 0
> acpi0: tables DSDT FACP MCFG APIC SSDT SSDT
> acpi0: wakeup devices PEX0(S5) PEX1(S5) PEX2(S5) PEX3(S5) HUB0(S5) UAR1(S1)
PS2M(S1) PS2K(S1) USB0(S4) USB1(S4) USB2(S4) USB3(S4) USBE(S4) AC97(S5)
MC97(S5) AZAL(S5) PCI0(S5)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: apic clock running at 199MHz
> ioapic0 at mainbus0: apid 2 pa 0xfec00000, version 20, 24 pins
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpiprt1 at acpi0: bus 2 (PEX0)
> acpiprt2 at acpi0: bus 3 (PEX1)
> acpiprt3 at acpi0: bus -1 (PEX2)
> acpiprt4 at acpi0: bus -1 (PEX3)
> acpiprt5 at acpi0: bus 4 (HUB0)
> acpicpu0 at acpi0: FVS, 3400, 2800 MHz
> acpibtn0 at acpi0: PWRB
> bios0: ROM list: 0xc0000/0xec00 0xd0000/0x1800 0xef000/0x1000!
> pci0 at mainbus0 bus 0: configuration mode 1 (bios)
> pchb0 at pci0 dev 0 function 0 "Intel 82925X Host" rev 0x05
> ppb0 at pci0 dev 1 function 0 "Intel 82925X PCIE" rev 0x05: apic 2 int 16
(irq 5)
> pci1 at ppb0 bus 1
> vga1 at pci1 dev 0 function 0 "NVIDIA GeForce 7600 GS" rev 0xa1
> wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> azalia0 at pci0 dev 27 function 0 "Intel 82801FB HD Audio" rev 0x03: apic 2
int 16 (irq 5)
> azalia0: codecs: Realtek ALC260
> audio0 at azalia0
> ppb1 at pci0 dev 28 function 0 "Intel 82801FB PCIE" rev 0x03: apic 2 int 16
(irq 5)
> pci2 at ppb1 bus 2
> ppb2 at pci0 dev 28 function 1 "Intel 82801FB PCIE" rev 0x03: apic 2 int 17
(irq 10)
> pci3 at ppb2 bus 3
> bge0 at pci3 dev 0 function 0 "Broadcom BCM5751" rev 0x01, BCM5750 A1
(0x4001): apic 2 int 17 (irq 10), address 00:14:85:1d:03:a8
> brgphy0 at bge0 phy 1: BCM5750 10/100/1000baseT PHY, rev. 0
> uhci0 at pci0 dev 29 function 0 "Intel 82801FB USB" rev 0x03: apic 2 int 23
(irq 3)
> uhci1 at pci0 dev 29 function 1 "Intel 82801FB USB" rev 0x03: apic 2 int 19
(irq 11)
> uhci2 at pci0 dev 29 function 2 "Intel 82801FB USB" rev 0x03: apic 2 int 18
(irq 11)
> uhci3 at pci0 dev 29 function 3 "Intel 82801FB USB" rev 0x03: apic 2 int 16
(irq 5)
> ehci0 at pci0 dev 29 function 7 "Intel 82801FB USB" rev 0x03: apic 2 int 23
(irq 3)
> usb0 at ehci0: USB revision 2.0
> uhub0 at usb0 "Intel EHCI root hub" rev 2.00/1.00 addr 1
> ppb3 at pci0 dev 30 function 0 "Intel 82801BA Hub-to-PCI" rev 0xd3
> pci4 at ppb3 bus 4
> "TI TSB43AB23 FireWire" rev 0x00 at pci4 dev 5 function 0 not configured
> ichpcib0 at pci0 dev 31 function 0 "Intel 82801FB LPC" rev 0x03: PM
disabled
> pciide0 at pci0 dev 31 function 1 "Intel 82801FB IDE" rev 0x03: DMA, channel
0 configured to compatibility, channel 1 configured to compatibility
> atapiscsi0 at pciide0 channel 0 drive 0
> scsibus0 at atapiscsi0: 2 targets
> cd0 at scsibus0 targ 0 lun 0: <LITE-ON, DVDRW SHW-160P6S, PS08> ATAPI
5/cdrom removable
> cd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 4
> pciide0: channel 1 disabled (no drives)
> pciide2 at pci0 dev 31 function 2 "Intel 82801FR SATA" rev 0x03: DMA,
channel 0 configured to native-PCI, channel 1 configured to native-PCI
> pciide2: using apic 2 int 19 (irq 11) for native-PCI interrupt
> wd1 at pciide2 channel 0 drive 0: <ST31000520AS>
> wd1: 16-sector PIO, LBA48, 953869MB, 1953525168 sectors
> wd1(pciide2:0:0): using PIO mode 4, Ultra-DMA mode 5
> wd0 at pciide2 channel 1 drive 1: <SAMSUNG SP2504C>
> wd0: 16-sector PIO, LBA48, 238475MB, 488397168 sectors
> wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 5
> ichiic0 at pci0 dev 31 function 3 "Intel 82801FB SMBus" rev 0x03: apic 2 int
19 (irq 11)
> iic0 at ichiic0
> spdmem0 at iic0 addr 0x50: 1GB DDR2 SDRAM non-parity PC2-4200CL3
> usb1 at uhci0: USB revision 1.0
> uhub1 at usb1 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb2 at uhci1: USB revision 1.0
> uhub2 at usb2 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb3 at uhci2: USB revision 1.0
> uhub3 at usb3 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb4 at uhci3: USB revision 1.0
> uhub4 at usb4 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> isa0 at ichpcib0
> isadma0 at isa0
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5
> pckbd0 at pckbc0 (kbd slot)
> pckbc0: using irq 1 for kbd slot
> wskbd0 at pckbd0: console keyboard, using wsdisplay0
> pmsi0 at pckbc0 (aux slot)
> pckbc0: using irq 12 for aux slot
> wsmouse0 at pmsi0 mux 0
> pcppi0 at isa0 port 0x61
> midi0 at pcppi0: <PC speaker>
> spkr0 at pcppi0
> lpt0 at isa0 port 0x378/4 irq 7
> it0 at isa0 port 0x2e/2: IT8712F rev 7, EC port 0x290
> npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
> fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
> mtrr: Pentium Pro MTRR support
> Kernelized RAIDframe activated
> umass0 at uhub0 port 8 configuration 1 interface 0 "Generic USB2.0 Card
Reader" rev 2.00/1.9c addr 2
> umass0: using SCSI over Bulk-Only
> scsibus1 at umass0: 2 targets, initiator 0
> sd0 at scsibus1 targ 1 lun 0: <Generic, IC1210 CF, 1.9C> SCSI0 0/direct
removable
> sd0: drive offline
> sd1 at scsibus1 targ 1 lun 1: <Generic, IC1210 MS, 1.9C> SCSI0 0/direct
removable
> sd1: drive offline
> sd2 at scsibus1 targ 1 lun 2: <Generic, IC1210 MMC/SD, 1.9C> SCSI0 0/direct
removable
> sd2: drive offline
> sd3 at scsibus1 targ 1 lun 3: <Generic, IC1210 SM, 1.9C> SCSI0 0/direct
removable
> sd3: drive offline
> cd0(atapiscsi0:0:0): Check Condition (error 0x70) on opcode 0x0
>    SENSE KEY: Not Ready
>     ASC/ASCQ: Medium Not Present
> softraid0 at root
> root on wd1a swap on wd1b dump on wd1b
> WARNING: / was not properly unmounted
> raidlookup on device: /dev/wd2f failed !
> Hosed component: /dev/wd2f.
> Hosed component: /dev/wd2f.
> raid2: Component /dev/wd0f being configured at row: 0 col: 0
>         Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
>         Version: 2 Serial Number: 2007112802 Mod Counter: 364
>         Clean: No Status: 0
> /dev/wd0f is not clean !
> raid2: Ignoring /dev/wd2f.
> raid2 at root
>
>        Here is the raid configuration file:
> # cat /etc/raid-stuff/raid2-big.conf
> START array
> # numRow numCol numSpare
> 1 2 0
>
> START disks
> /dev/wd0f
> /dev/wd2f
>
> START layout
> # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
> 128 1 1 1
>
> START queue
> fifo 100

Re: Computer stops responding (freezes up) during uncorrectable data error

Reply via email to