Bug#582275: ext3 filesystem corruption on md RAID1 device
severity 582275 normal thanks On Fri, Jun 18, 2010 at 07:25:18AM -0400, Theodore Tso wrote: On Jun 18, 2010, at 7:09 AM, Theodore Tso wrote: It could be an e2fsck bug, or it could be a hardware issue. In my experience, every time I've tried digging into problems with e2fsck -fy not fixing all problems in a single pass, it's been a hardware problem. That being said, multiply claimed blocks is something that isn't exercised that much, so it's *possible* that it is an e2fsck bug. I really would need a reproducible test case before I could do anything though. And reviewing the thread, the fact that you are reliably getting directory corruption every single time you're booting, and the reliability of the hardware has been called into question (forgive me for being a little suspicious of people trying to do reliable storage using IDE devices). Adjusting severity, since it appears to be limited to the submitter's hardware. Cheers, Moritz -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20100805214805.ga28...@galadriel.inutil.org
Bug#582275: ext3 filesystem corruption on md RAID1 device
Hi Jan, I tried for a while with alternate hardware and the original controller but the error did never happen again. I think your idea of a bug in e2fsck's handling of multiply claimed blocks is the only explanation: Maybe during a corruption long ago a number of multiply claimed blocks existed and where reduced with each reboot. Is this possible? How can this be fixed? Best regards, Reiner. -Original Message- From: Jan Kara [mailto:j...@suse.cz] Sent: Tuesday, June 01, 2010 12:23 PM To: Buehl, Reiner Subject: Re: ext3 filesystem corruption on md RAID1 device Hi, On Tue 01-06-10 07:25:28, Buehl, Reiner wrote: I have organized a second Promise SATA controller now. Should I move one or both of the disks from my RAID1 from the Intel to the Promise controller for testing? I have no strong opinion. I'd just try moving the primary disk to the new controller and see what happens... Regarding the repeated fscks: It seems that the fsck does not completely fix everything even though it states so. The two attached fsck outputs that you had a look at earlier are from two fscks that I did directly after each other without any reboot. They are just a minute or two apart and both find problems in the same group as far as I can see. Yes, I've noticed that last time but I'd blamed it to a bug in e2fsck in handling of multiply claimed blocks. So if you run fsck for a third time, does it still complain (note that during second run, there were no more multiply claimed blocks)? Honza -Original Message- From: linux-ide-ow...@vger.kernel.org [mailto:linux-ide- ow...@vger.kernel.org] On Behalf Of Jan Kara Sent: Monday, May 31, 2010 10:56 PM To: Buehl, Reiner Cc: Jan Kara; ty...@mit.edu; linux-...@vger.kernel.org; linux- fsde...@vger.kernel.org; 582...@bugs.debian.org Subject: Re: ext3 filesystem corruption on md RAID1 device Hi, On Sat 29-05-10 13:48:56, Buehl, Reiner wrote: I have an unused PATA controller on the mainboard - unfortunately I do not have two SATA-to-PATA converters. Do you think that connecting the two disks to the PATA controller is a good option? If so, I would have to buy to adapters. I have heard that the boards with both PATA and SATA slots have just one disk controller which has only different kinds of slots. So I'm not sure that would change anything. More interesting would be if you could borrow a SATA controller that you could plug into a PCI slot or so... Alternatively you could also try shuffling disks in the slots you already have. Is there a way to fix the current file system problem with the original setup first since repeated FSCKs always seem to return errors for the same area again and again? Do you mean that fsck does not fix all the problems? So if you run fsck -f -y several times in a row is keeps reporting problems? What kind of problems is it? Honza -Original Message- From: Jan Kara [mailto:j...@suse.cz] Sent: Thursday, May 27, 2010 10:13 PM To: Buehl, Reiner Cc: ty...@mit.edu; linux-...@vger.kernel.org; linux- fsde...@vger.kernel.org Subject: Re: ext3 filesystem corruption on md RAID1 device Hi, On Sun 23-05-10 05:46:29, Buehl, Reiner wrote: Hi Ted, please find attached the output of two fsck.ext3 -fy /dev/md1 runs conducted directly after each other. The ext3 fs error message in dmesg was: I took a look at the fsck logs. So there are inodes 17269110- 17269115 (from group 2108 if my counting is correct) that have problems. 17269110 is the corrupted directory, 17269111 is an unconnected directory (was subdir of 17269110), 17269112-17269115 share blocks with some other inodes. Interestingly enough, these other inodes are all in group 2120 and also the blocks that are shared are in group 2120. Multiply claimed blocks in this amount are usually caused by a corrupted block bitmap. In your case, it seems as if bitmap for group 2120 was not written (or was zeroed?) and thus later some inodes reused the space. This kind of corruption is usually caused by HW - flaky memory or disk controller (I wouldn't suspect disks in your case since the problem seems to consistently happen on both the original disk and the mirror). Do you have any chance of trying a different HW? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux- ide in the body of a
Bug#582275: ext3 filesystem corruption on md RAID1 device
On Jun 18, 2010, at 3:12 AM, Buehl, Reiner wrote: Hi Jan, I tried for a while with alternate hardware and the original controller but the error did never happen again. I think your idea of a bug in e2fsck's handling of multiply claimed blocks is the only explanation: Maybe during a corruption long ago a number of multiply claimed blocks existed and where reduced with each reboot. Is this possible? How can this be fixed? It could be an e2fsck bug, or it could be a hardware issue. In my experience, every time I've tried digging into problems with e2fsck -fy not fixing all problems in a single pass, it's been a hardware problem. That being said, multiply claimed blocks is something that isn't exercised that much, so it's *possible* that it is an e2fsck bug. I really would need a reproducible test case before I could do anything though. -- Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/ee6e2e73-39b6-4d90-b2e2-3ba2c1dd4...@mit.edu
Bug#582275: ext3 filesystem corruption on md RAID1 device
On Jun 18, 2010, at 7:09 AM, Theodore Tso wrote: It could be an e2fsck bug, or it could be a hardware issue. In my experience, every time I've tried digging into problems with e2fsck -fy not fixing all problems in a single pass, it's been a hardware problem. That being said, multiply claimed blocks is something that isn't exercised that much, so it's *possible* that it is an e2fsck bug. I really would need a reproducible test case before I could do anything though. And reviewing the thread, the fact that you are reliably getting directory corruption every single time you're booting, and the reliability of the hardware has been called into question (forgive me for being a little suspicious of people trying to do reliable storage using IDE devices). -- Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/bb158359-ccdb-49e6-975d-b2a9f3062...@mit.edu
Bug#582275: ext3 filesystem corruption on md RAID1 device
Hi, On Sat 29-05-10 13:48:56, Buehl, Reiner wrote: I have an unused PATA controller on the mainboard - unfortunately I do not have two SATA-to-PATA converters. Do you think that connecting the two disks to the PATA controller is a good option? If so, I would have to buy to adapters. I have heard that the boards with both PATA and SATA slots have just one disk controller which has only different kinds of slots. So I'm not sure that would change anything. More interesting would be if you could borrow a SATA controller that you could plug into a PCI slot or so... Alternatively you could also try shuffling disks in the slots you already have. Is there a way to fix the current file system problem with the original setup first since repeated FSCKs always seem to return errors for the same area again and again? Do you mean that fsck does not fix all the problems? So if you run fsck -f -y several times in a row is keeps reporting problems? What kind of problems is it? Honza -Original Message- From: Jan Kara [mailto:j...@suse.cz] Sent: Thursday, May 27, 2010 10:13 PM To: Buehl, Reiner Cc: ty...@mit.edu; linux-...@vger.kernel.org; linux- fsde...@vger.kernel.org Subject: Re: ext3 filesystem corruption on md RAID1 device Hi, On Sun 23-05-10 05:46:29, Buehl, Reiner wrote: Hi Ted, please find attached the output of two fsck.ext3 -fy /dev/md1 runs conducted directly after each other. The ext3 fs error message in dmesg was: I took a look at the fsck logs. So there are inodes 17269110-17269115 (from group 2108 if my counting is correct) that have problems. 17269110 is the corrupted directory, 17269111 is an unconnected directory (was subdir of 17269110), 17269112-17269115 share blocks with some other inodes. Interestingly enough, these other inodes are all in group 2120 and also the blocks that are shared are in group 2120. Multiply claimed blocks in this amount are usually caused by a corrupted block bitmap. In your case, it seems as if bitmap for group 2120 was not written (or was zeroed?) and thus later some inodes reused the space. This kind of corruption is usually caused by HW - flaky memory or disk controller (I wouldn't suspect disks in your case since the problem seems to consistently happen on both the original disk and the mirror). Do you have any chance of trying a different HW? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- Jan Kara j...@suse.cz SUSE Labs, CR -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20100531205547.gh5...@quack.suse.cz
Bug#582275: ext3 filesystem corruption on md RAID1 device
Hi, I have an unused PATA controller on the mainboard - unfortunately I do not have two SATA-to-PATA converters. Do you think that connecting the two disks to the PATA controller is a good option? If so, I would have to buy to adapters. Is there a way to fix the current file system problem with the original setup first since repeated FSCKs always seem to return errors for the same area again and again? Best regards, Reiner. -Original Message- From: Jan Kara [mailto:j...@suse.cz] Sent: Thursday, May 27, 2010 10:13 PM To: Buehl, Reiner Cc: ty...@mit.edu; linux-...@vger.kernel.org; linux- fsde...@vger.kernel.org Subject: Re: ext3 filesystem corruption on md RAID1 device Hi, On Sun 23-05-10 05:46:29, Buehl, Reiner wrote: Hi Ted, please find attached the output of two fsck.ext3 -fy /dev/md1 runs conducted directly after each other. The ext3 fs error message in dmesg was: I took a look at the fsck logs. So there are inodes 17269110-17269115 (from group 2108 if my counting is correct) that have problems. 17269110 is the corrupted directory, 17269111 is an unconnected directory (was subdir of 17269110), 17269112-17269115 share blocks with some other inodes. Interestingly enough, these other inodes are all in group 2120 and also the blocks that are shared are in group 2120. Multiply claimed blocks in this amount are usually caused by a corrupted block bitmap. In your case, it seems as if bitmap for group 2120 was not written (or was zeroed?) and thus later some inodes reused the space. This kind of corruption is usually caused by HW - flaky memory or disk controller (I wouldn't suspect disks in your case since the problem seems to consistently happen on both the original disk and the mirror). Do you have any chance of trying a different HW? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/ba8a2107b0fd8a48aeb0405bdc36ce4f3f495c3...@gvw1115exc.americas.hpqcorp.net