Bug#105924: marked as done (raidtools2: data loss when recovering from multiple "bad" disks)

Debian Bug Tracking System Mon, 14 Apr 2008 20:21:41 -0700

Your message dated Mon, 14 Apr 2008 23:10:16 -0400
with message-id <[EMAIL PROTECTED]>
and subject line raidtools2 has been removed from Debian, closing #105924
has caused the Debian Bug report #105924,
regarding raidtools2: data loss when recovering from multiple "bad" disks
to be marked as done.


This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [EMAIL PROTECTED]
immediately.)


-- 
105924: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=105924
Debian Bug Tracking System
Contact [EMAIL PROTECTED] with problems

--- Begin Message ---

Package: raidtools2
Version: 0.90.990824-11
Severity: grave

I just lost two weeks worth of data on my primary raid due to an error in
raidtools reconstruction procedure.  I'm still trying to work out exactly
what happened.  This is mostly my own fault for not making backups and
ignoring a known problem, but, raidtools could have performed better in
this case.

I have /dev/md0 mounted on /home, so I still have /var/log intact and
can go through and figure out exactly what broke when.  For me, /dev/md0
is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1.

The first sign of trouble is here, this seems to be flakey hardware or
a kernel bug causing DMA problems:

Jul  4 23:52:37 ale kernel: hdg: timeout waiting for DMA
Jul  4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul  4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady 
SeekComplete }
Jul  4 23:52:37 ale kernel: hde: timeout waiting for DMA
Jul  4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul  4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady 
SeekComplete }
Jul  4 23:52:45 ale kernel: hdg: timeout waiting for DMA
Jul  4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul  4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady 
SeekComplete }
Jul  5 00:07:08 ale -- MARK --
Jul  5 00:27:08 ale -- MARK --
Jul  5 00:47:08 ale -- MARK --
Jul  5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy }
Jul  5 00:48:28 ale kernel: hde: DMA disabled
Jul  5 00:48:29 ale kernel: ide2: reset: master: error (0x00?)
Jul  5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul  5 00:48:29 ale kernel: hde: drive not ready for command
Jul  5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul  5 00:48:29 ale kernel: hde: drive not ready for command
Jul  5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul  5 00:48:29 ale kernel: hde: drive not ready for command
Jul  5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul  5 00:48:29 ale kernel: hde: drive not ready for command
Jul  5 00:48:29 ale kernel: hde: drive not ready for command
Jul  5 00:48:32 ale kernel: ide2: reset: master: error (0x00?)
Jul  5 00:48:32 ale kernel: hde: status error: status=0x00 { }
Jul  5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 
16817840
Jul  5 00:48:32 ale kernel: ^IOperation continuing on 1 devices

At this point /dev/hde1 is marked bad, and the raid continues in degraded
mode using /dev/hdg1 only.  Later, it happens on the other drive:

Jul  5 13:29:33 ale kernel: hdg: lost interrupt
Jul  5 13:30:03 ale last message repeated 3 times
Jul  5 13:31:03 ale last message repeated 6 times
Jul  5 13:32:03 ale last message repeated 6 times
Jul  5 13:32:43 ale last message repeated 4 times
Jul  5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy }
Jul  5 13:32:56 ale kernel: ide3: reset: master: error (0x00?)
Jul  5 13:32:56 ale kernel: hdg: status error: status=0x00 { }
Jul  5 13:32:56 ale kernel: hdg: drive not ready for command
[clip]
Jul  5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)
Jul  5 13:32:58 ale kernel: hdg: status error: status=0x00 { }
Jul  5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 
77070400
[clip]
Jul  5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery 
thread got woken up ...
Jul  5 13:32:58 ale kernel:
Jul  5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery 
thread finished ...

Errors like this pour into /var/log/messages at very high speed until:

Jul  5 13:36:21 ale kernel: hdg: status error: status=0x00 { }
Jul  5 13:36:22 ale kernel: hdg: drive not ready for command
Jul  5 13:36:30 ale kernel: ide3: reset: success

and then all is well again.  Or is it?  At the end of this thrashing,
/dev/hdg1 has also been marked as bad, but the machine keeps using it.
I neglect the machine, knowing it's in degraded mode, but not having
the time to go fix it.

Eventually an unrelated problem crops up which requires attention:

Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying
Jul 19 02:02:56 ale last message repeated 4 times

Its network card vanishes.  I don't have a clue what caused this, but I
can't ignore it any longer, so I come in and power cycle the machine.
Big mistake.

Jul 19 11:53:10 ale kernel: md: raid0 personality registered
Jul 19 11:53:10 ale kernel: md: raid1 personality registered
Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, 
MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays.
Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 
0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 
0000004b]Jul 19 11:53:10 ale kernel: md: autorun ...
Jul 19 11:53:10 ale kernel: md: considering hdg1 ...
Jul 19 11:53:10 ale kernel: md:  adding hdg1 ...
Jul 19 11:53:10 ale kernel: md:  adding hde1 ...
Jul 19 11:53:10 ale kernel: md: created md0
Jul 19 11:53:10 ale kernel: md: bind<hde1,1>
Jul 19 11:53:10 ale kernel: md: bind<hdg1,2>
Jul 19 11:53:10 ale kernel: md: running: <hdg1><hde1>
Jul 19 11:53:10 ale kernel: md: now!
Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k
Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 
124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1
Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0
Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing 
mirrors
Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Jul 19 11:53:10 ale kernel: md: syncing RAID array md0
Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 
KB/sec/disc.
Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device
Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but 
not more than 100000 KB/sec) for reconstruction.
Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 
60030336 blocks.
Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 
60030336
Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb 
offset: 60030336
Jul 19 11:53:10 ale kernel: md: ... autorun DONE.

It copies the contents of /dev/hde1, which was marked bad first, onto
/dev/hdg1, which was marked bad later, overwriting two weeks worth of
changes.

I blame myself for this, *but* this should not have happened.  The recovery
process should have copied /dev/hdg1 onto /dev/hde1, and not the other way
around!

Now, I'm not really sure where raidtools starts and the kernel ends, so
this may actually be a kernel problem.  It was running 2.4.5 at the time
this happened.   If you could forward this report to whoever is most
responsible for the bits that handle reconstruction, I'd appreciate it.

Thanks,

Eric

-- System Information
Debian Release: testing/unstable
Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown

Versions of the packages raidtools2 depends on:
ii  debconf        0.9.66         Debian configuration management system
ii  libc6          2.2.3-6        GNU C Library: Shared libraries and Timezone

--- End Message ---

--- Begin Message ---

Version: 1.00.3-17+rm

The raidtools2 package has been removed from Debian testing, unstable and
experimental, so I am now closing the bugs that were still opened
against it.

For more information about this package's removal, read
http://bugs.debian.org/298968 . That bug might give the reasons why
this package was removed, and suggestions of possible replacements.

Don't hesitate to reply to this mail if you have any question.

Thank you for your contribution to Debian.
Barry deFreese

--- End Message ---

Bug#105924: marked as done (raidtools2: data loss when recovering from multiple "bad" disks)

Reply via email to