Bug#582275: ext3 filesystem corruption on md RAID1 device

2010-08-05 Thread Moritz Muehlenhoff
severity 582275 normal
thanks

On Fri, Jun 18, 2010 at 07:25:18AM -0400, Theodore Tso wrote:
 
 On Jun 18, 2010, at 7:09 AM, Theodore Tso wrote:
  
  It could be an e2fsck bug, or it could be a hardware issue.  In my 
  experience, every time
  I've tried digging into problems with e2fsck -fy not fixing all problems in 
  a single
  pass, it's been a hardware problem.  That being said, multiply claimed 
  blocks is
  something that isn't exercised that much, so it's *possible* that it is an 
  e2fsck bug.
  
  I really would need a reproducible test case before I could do anything 
  though.
 
 And reviewing the thread, the fact that you are reliably getting directory 
 corruption
 every single time you're booting, and the reliability of the hardware has 
 been 
 called into question (forgive me for being a little suspicious of people 
 trying
 to do reliable storage using IDE devices).

Adjusting severity, since it appears to be limited to the submitter's
hardware.

Cheers,
Moritz



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20100805214805.ga28...@galadriel.inutil.org



Bug#582275: ext3 filesystem corruption on md RAID1 device

2010-06-18 Thread Buehl, Reiner
Hi Jan,

I tried for a while with alternate hardware and the original controller but the 
error did never happen again. I think your idea of a bug in e2fsck's handling 
of multiply claimed blocks is the only explanation: Maybe during a corruption 
long ago a number of multiply claimed blocks existed and where reduced with 
each reboot. Is this possible? How can this be fixed?

Best regards,
Reiner.

 -Original Message-
 From: Jan Kara [mailto:j...@suse.cz]
 Sent: Tuesday, June 01, 2010 12:23 PM
 To: Buehl, Reiner
 Subject: Re: ext3 filesystem corruption on md RAID1 device

   Hi,

 On Tue 01-06-10 07:25:28, Buehl, Reiner wrote:
  I have organized a second Promise SATA controller now. Should I move
 one
  or both of the disks from my RAID1 from the Intel to the Promise
 controller
  for testing?
   I have no strong opinion. I'd just try moving the primary disk to the
 new
 controller and see what happens...

  Regarding the repeated fscks: It seems that the fsck does not
 completely fix
  everything even though it states so. The two attached fsck outputs
 that you
  had a look at earlier are from two fscks that I did directly after
 each other
  without any reboot. They are just a minute or two apart and both find
 problems
  in the same group as far as I can see.
   Yes, I've noticed that last time but I'd blamed it to a bug in e2fsck
 in
 handling of multiply claimed blocks. So if you run fsck for a third
 time,
 does it still complain (note that during second run, there were no more
 multiply claimed blocks)?

   Honza

   -Original Message-
   From: linux-ide-ow...@vger.kernel.org [mailto:linux-ide-
   ow...@vger.kernel.org] On Behalf Of Jan Kara
   Sent: Monday, May 31, 2010 10:56 PM
   To: Buehl, Reiner
   Cc: Jan Kara; ty...@mit.edu; linux-...@vger.kernel.org; linux-
   fsde...@vger.kernel.org; 582...@bugs.debian.org
   Subject: Re: ext3 filesystem corruption on md RAID1 device
  
 Hi,
  
   On Sat 29-05-10 13:48:56, Buehl, Reiner wrote:
I have an unused PATA controller on the mainboard - unfortunately
 I
   do
not have two SATA-to-PATA converters. Do you think that
 connecting
   the
two disks to the PATA controller is a good option? If so, I would
   have to
buy to adapters.
 I have heard that the boards with both PATA and SATA slots have
 just
   one
   disk controller which has only different kinds of slots. So I'm not
   sure
   that would change anything. More interesting would be if you could
   borrow a
   SATA controller that you could plug into a PCI slot or so...
   Alternatively
   you could also try shuffling disks in the slots you already have.
  
Is there a way to fix the current file system problem with the
   original
setup first since repeated FSCKs always seem to return errors for
 the
same area again and again?
 Do you mean that fsck does not fix all the problems? So if you
 run
   fsck
   -f -y several times in a row is keeps reporting problems? What kind
 of
   problems is it?
  
 Honza
  
 -Original Message-
 From: Jan Kara [mailto:j...@suse.cz]
 Sent: Thursday, May 27, 2010 10:13 PM
 To: Buehl, Reiner
 Cc: ty...@mit.edu; linux-...@vger.kernel.org; linux-
 fsde...@vger.kernel.org
 Subject: Re: ext3 filesystem corruption on md RAID1 device

   Hi,

 On Sun 23-05-10 05:46:29, Buehl, Reiner wrote:
  Hi Ted,
 
  please find attached the output of two fsck.ext3 -fy /dev/md1
   runs
  conducted directly after each other. The ext3 fs error
 message in
 dmesg
  was:
   I took a look at the fsck logs. So there are inodes 17269110-
   17269115
 (from group 2108 if my counting is correct) that have problems.
 17269110 is
 the corrupted directory, 17269111 is an unconnected directory
 (was
 subdir of
 17269110), 17269112-17269115 share blocks with some other
 inodes.
   Interestingly enough, these other inodes are all in group
 2120
   and
 also
 the blocks that are shared are in group 2120.
   Multiply claimed blocks in this amount are usually caused by
 a
 corrupted
 block bitmap. In your case, it seems as if bitmap for group
 2120
   was
 not
 written (or was zeroed?) and thus later some inodes reused the
   space.
 This
 kind of corruption is usually caused by HW - flaky memory or
 disk
 controller (I wouldn't suspect disks in your case since the
 problem
 seems to
 consistently happen on both the original disk and the mirror).
 Do
   you
 have
 any chance of trying a different HW?

   Honza
 --
 Jan Kara j...@suse.cz
 SUSE Labs, CR
   --
   Jan Kara j...@suse.cz
   SUSE Labs, CR
   --
   To unsubscribe from this list: send the line unsubscribe linux-
 ide in
   the body of a 

Bug#582275: ext3 filesystem corruption on md RAID1 device

2010-06-18 Thread Theodore Tso

On Jun 18, 2010, at 3:12 AM, Buehl, Reiner wrote:

 Hi Jan,
 
 I tried for a while with alternate hardware and the original controller but 
 the error did never happen again. I think your idea of a bug in e2fsck's 
 handling of multiply claimed blocks is the only explanation: Maybe during a 
 corruption long ago a number of multiply claimed blocks existed and where 
 reduced with each reboot. Is this possible? How can this be fixed?
 

It could be an e2fsck bug, or it could be a hardware issue.  In my experience, 
every time
I've tried digging into problems with e2fsck -fy not fixing all problems in a 
single
pass, it's been a hardware problem.  That being said, multiply claimed blocks is
something that isn't exercised that much, so it's *possible* that it is an 
e2fsck bug.

I really would need a reproducible test case before I could do anything though.

 -- Ted





--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/ee6e2e73-39b6-4d90-b2e2-3ba2c1dd4...@mit.edu



Bug#582275: ext3 filesystem corruption on md RAID1 device

2010-06-18 Thread Theodore Tso

On Jun 18, 2010, at 7:09 AM, Theodore Tso wrote:
 
 It could be an e2fsck bug, or it could be a hardware issue.  In my 
 experience, every time
 I've tried digging into problems with e2fsck -fy not fixing all problems in a 
 single
 pass, it's been a hardware problem.  That being said, multiply claimed blocks 
 is
 something that isn't exercised that much, so it's *possible* that it is an 
 e2fsck bug.
 
 I really would need a reproducible test case before I could do anything 
 though.

And reviewing the thread, the fact that you are reliably getting directory 
corruption
every single time you're booting, and the reliability of the hardware has been 
called into question (forgive me for being a little suspicious of people trying
to do reliable storage using IDE devices).

-- Ted




--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/bb158359-ccdb-49e6-975d-b2a9f3062...@mit.edu



Bug#582275: ext3 filesystem corruption on md RAID1 device

2010-05-31 Thread Jan Kara
  Hi,

On Sat 29-05-10 13:48:56, Buehl, Reiner wrote:
 I have an unused PATA controller on the mainboard - unfortunately I do
 not have two SATA-to-PATA converters. Do you think that connecting the
 two disks to the PATA controller is a good option? If so, I would have to
 buy to adapters.
  I have heard that the boards with both PATA and SATA slots have just one
disk controller which has only different kinds of slots. So I'm not sure
that would change anything. More interesting would be if you could borrow a
SATA controller that you could plug into a PCI slot or so...  Alternatively
you could also try shuffling disks in the slots you already have.

 Is there a way to fix the current file system problem with the original
 setup first since repeated FSCKs always seem to return errors for the
 same area again and again? 
  Do you mean that fsck does not fix all the problems? So if you run fsck
-f -y several times in a row is keeps reporting problems? What kind of
problems is it?

Honza

  -Original Message-
  From: Jan Kara [mailto:j...@suse.cz]
  Sent: Thursday, May 27, 2010 10:13 PM
  To: Buehl, Reiner
  Cc: ty...@mit.edu; linux-...@vger.kernel.org; linux-
  fsde...@vger.kernel.org
  Subject: Re: ext3 filesystem corruption on md RAID1 device
  
Hi,
  
  On Sun 23-05-10 05:46:29, Buehl, Reiner wrote:
   Hi Ted,
  
   please find attached the output of two fsck.ext3 -fy /dev/md1 runs
   conducted directly after each other. The ext3 fs error message in
  dmesg
   was:
I took a look at the fsck logs. So there are inodes 17269110-17269115
  (from group 2108 if my counting is correct) that have problems.
  17269110 is
  the corrupted directory, 17269111 is an unconnected directory (was
  subdir of
  17269110), 17269112-17269115 share blocks with some other inodes.
Interestingly enough, these other inodes are all in group 2120 and
  also
  the blocks that are shared are in group 2120.
Multiply claimed blocks in this amount are usually caused by a
  corrupted
  block bitmap. In your case, it seems as if bitmap for group 2120 was
  not
  written (or was zeroed?) and thus later some inodes reused the space.
  This
  kind of corruption is usually caused by HW - flaky memory or disk
  controller (I wouldn't suspect disks in your case since the problem
  seems to
  consistently happen on both the original disk and the mirror). Do you
  have
  any chance of trying a different HW?
  
  Honza
  --
  Jan Kara j...@suse.cz
  SUSE Labs, CR
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20100531205547.gh5...@quack.suse.cz



Bug#582275: ext3 filesystem corruption on md RAID1 device

2010-05-29 Thread Buehl, Reiner
Hi,

I have an unused PATA controller on the mainboard - unfortunately I do not have 
two SATA-to-PATA converters. Do you think that connecting the two disks to the 
PATA controller is a good option? If so, I would have to buy to adapters.

Is there a way to fix the current file system problem with the original setup 
first since repeated FSCKs always seem to return errors for the same area again 
and again? 

Best regards,
Reiner.  

 -Original Message-
 From: Jan Kara [mailto:j...@suse.cz]
 Sent: Thursday, May 27, 2010 10:13 PM
 To: Buehl, Reiner
 Cc: ty...@mit.edu; linux-...@vger.kernel.org; linux-
 fsde...@vger.kernel.org
 Subject: Re: ext3 filesystem corruption on md RAID1 device
 
   Hi,
 
 On Sun 23-05-10 05:46:29, Buehl, Reiner wrote:
  Hi Ted,
 
  please find attached the output of two fsck.ext3 -fy /dev/md1 runs
  conducted directly after each other. The ext3 fs error message in
 dmesg
  was:
   I took a look at the fsck logs. So there are inodes 17269110-17269115
 (from group 2108 if my counting is correct) that have problems.
 17269110 is
 the corrupted directory, 17269111 is an unconnected directory (was
 subdir of
 17269110), 17269112-17269115 share blocks with some other inodes.
   Interestingly enough, these other inodes are all in group 2120 and
 also
 the blocks that are shared are in group 2120.
   Multiply claimed blocks in this amount are usually caused by a
 corrupted
 block bitmap. In your case, it seems as if bitmap for group 2120 was
 not
 written (or was zeroed?) and thus later some inodes reused the space.
 This
 kind of corruption is usually caused by HW - flaky memory or disk
 controller (I wouldn't suspect disks in your case since the problem
 seems to
 consistently happen on both the original disk and the mirror). Do you
 have
 any chance of trying a different HW?
 
   Honza
 --
 Jan Kara j...@suse.cz
 SUSE Labs, CR



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/ba8a2107b0fd8a48aeb0405bdc36ce4f3f495c3...@gvw1115exc.americas.hpqcorp.net