[Ocfs2-users] fsck doesn't fix bad chain

2011-09-16 Thread Andre Nathan
Hello

For a while I had seen errors like this in the kernel logs:

  OCFS2: ERROR (device drbd5): ocfs2_validate_gd_parent: Group 
  descriptor #69084874 has bad chain 126
  File system is now read-only due to the potential of on-disk 
  corruption. Please run fsck.ocfs2 once the file system is unmounted.

This always happened in the same device, and whenever it happened I ran
fsck.ocfs2 -fy /dev/drbd5, which showed messages like these:

  [GROUP_FREE_BITS] Group descriptor at block 201309696 claims to have 
  9893 free bits which is more than 9886 bits indicated by the bitmap. 
  Drop its free bit count down to the total? y
  [CHAIN_BITS] Chain 166 in allocator inode 11 has 1264713 bits 
  marked free out of 1516032 total bits but the block groups in the 
  chain have 1264706 free out of 1516032 total.  Fix this by updating 
  the chain record? y
  [CHAIN_GROUP_BITS] Allocator inode 11 has 79407510 bits marked used 
  out of 365955414 total bits but the chains have 79407911 used out of 
  365955414 total.  Fix this by updating the inode counts? y
  [INODE_COUNT] Inode 69085510 has a link count of 0 on disk but 
  directory entry references come to 1. Update the count on disk to 
  match? y

As time passed, the frequency of these issues started to increase, and
the last time it happened, I decided to run fsck twice in a row, and was
surprised to see it showed the same messages in both runs. It seems it
was unable to fix the problem.

I identified the files corresponding to the inodes using debugfs.ocfs2
and copied them to a new place, and then moved the copy over the
original file, in order to recreate the inodes. Whenever I did that for
one inode, the error above happened and the filesystem became read-only,
so I had to umount/mount the volume again in order to be able to write
to it again.

After doing this, I ran fsck.ocfs2 -fy again twice, and no errors were
reported. Since then I haven't seen this problem again.

I'm running kernel 2.6.35 and ocfs2-tools 1.6.4.

Has anyone else seen an issue like that?

Thanks
Andre


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Linux kernel crash due to ocfs2

2011-09-16 Thread Sunil Mushran
I got it. But I still don't see the symbols. Maybe we are corrupting the stack.
Maybe this is ppc specific. Do you have a x86/x86_64 box that can access
the same volume? If so I could give you a drop of the same for that arch.

Also, have to run fsck on this volume before? One reason o2image could
fail is if there is a bad block pointer. While it is supposed to handle all such
cases, it is known to miss some cases.

On 09/16/2011 12:06 AM, Betzos Giorgos wrote:
 Please try http://portal-md.glk.gr/ocfs2/core.32578.bz2

 Please let me know, in case you have any problem downloading it.

 Thanks,

 George

 On Thu, 2011-09-15 at 09:45 -0700, Sunil Mushran wrote:
 I was hoping to get a readable stack. Please could you provide a link to
 the coredump.

 On 09/15/2011 02:51 AM, Betzos Giorgos wrote:
 Hello,

 I am sorry for the delay in responding. Unfortunately, if faulted again.

 Here is the log. Although my email client folds the Memory Map lines.
 The core file is available.

 Thanks,

 George

 # ./o2image.ppc.dbg /dev/mapper/mpath0 /files_shared/u02.o2image
 *** glibc detected *** ./o2image.ppc.dbg: corrupted double-linked list:
 0x10075000 ***
 === Backtrace: =
 /lib/libc.so.6[0xfeb1ab4]
 /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
 ./o2image.ppc.dbg[0x1000d098]
 ./o2image.ppc.dbg[0x1000297c]
 ./o2image.ppc.dbg[0x10001eb8]
 ./o2image.ppc.dbg[0x1000228c]
 ./o2image.ppc.dbg[0x10002804]
 ./o2image.ppc.dbg[0x10001eb8]
 ./o2image.ppc.dbg[0x1000228c]
 ./o2image.ppc.dbg[0x10002804]
 ./o2image.ppc.dbg[0x10003bbc]
 ./o2image.ppc.dbg[0x10004480]
 /lib/libc.so.6[0xfe4dc60]
 /lib/libc.so.6[0xfe4dea0]
 === Memory map: 
 0010-0012 r-xp 0010 00:00 0
 [vdso]
 0f43-0f44 r-xp  08:13
 180307 /lib/libcom_err.so.2.1
 0f44-0f45 rw-p  08:13
 180307 /lib/libcom_err.so.2.1
 0f90-0f9c r-xp  08:13
 180293 /lib/libglib-2.0.so.0.1200.3
 0f9c-0f9d rw-p 000b 08:13
 180293 /lib/libglib-2.0.so.0.1200.3
 0fa4-0fa5 r-xp  08:13
 180292 /lib/librt-2.5.so
 0fa5-0fa6 r--p  08:13
 180292 /lib/librt-2.5.so
 0fa6-0fa7 rw-p 0001 08:13
 180292 /lib/librt-2.5.so
 0fce-0fd0 r-xp  08:13
 180291 /lib/libpthread-2.5.so
 0fd0-0fd1 r--p 0001 08:13
 180291 /lib/libpthread-2.5.so
 0fd1-0fd2 rw-p 0002 08:13
 180291 /lib/libpthread-2.5.so
 0fe3-0ffa r-xp  08:13
 180288 /lib/libc-2.5.so
 0ffa-0ffb r--p 0016 08:13
 180288 /lib/libc-2.5.so
 0ffb-0ffc rw-p 0017 08:13
 180288 /lib/libc-2.5.so
 0ffc-0ffe r-xp  08:13
 180287 /lib/ld-2.5.so
 0ffe-0fff r--p 0001 08:13
 180287 /lib/ld-2.5.so
 0fff-1000 rw-p 0002 08:13
 180287 /lib/ld-2.5.so
 1000-1005 r-xp  08:13
 7487795/root/o2image.ppc.dbg
 1005-1006 rw-p 0004 08:13
 7487795/root/o2image.ppc.dbg
 1006-1009 rwxp 1006 00:00 0
 [heap]
 f768-f7ff rw-p f768 00:00 0
 ff9a-ffaf rw-p ff9a 00:00 0
 [stack]
 Aborted (core dumped)


 On Thu, 2011-09-08 at 12:10 -0700, Sunil Mushran wrote:
 http://oss.oracle.com/~smushran/o2image.ppc.dbg

 Use the above executable. Hoping it won't fault. But if it does
 email me the backtrace. That trace will be readable as the exec
 has debugging symbols enabled.

 On 09/07/2011 11:24 PM, Betzos Giorgos wrote:
 # rpm -q ocfs2-tools
 ocfs2-tools-1.4.4-1.el5.ppc

 On Wed, 2011-09-07 at 09:13 -0700, Sunil Mushran wrote:
 version of ocfs2-tools?

 On 09/07/2011 09:10 AM, Betzos Giorgos wrote:
 Hello,

 I tried what you suggested but here is what I got:

 # o2image /dev/mapper/mpath0 /files_shared/u02.o2image
 *** glibc detected *** o2image: corrupted double-linked list: 
 0x10045000 ***
 === Backtrace: =
 /lib/libc.so.6[0xfeb1ab4]
 /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
 o2image[0x10007bb0]
 o2image[0x10002748]
 o2image[0x10001f50]
 o2image[0x10002334]
 o2image[0x100026a0]
 o2image[0x10001f50]
 o2image[0x10002334]
 o2image[0x100026a0]
 o2image[0x1000358c]
 o2image[0x10003e28]
 /lib/libc.so.6[0xfe4dc60]
 /lib/libc.so.6[0xfe4dea0]
 === Memory map: 
 0010-0012 r-xp 0010 00:00 0 
  [vdso]
 0f55-0f56 r-xp  08:13 2881590   
  /lib/libcom_err.so.2.1
 0f56-0f57 rw-p  08:13 2881590   
  /lib/libcom_err.so.2.1
 0f90-0f9c r-xp  08:13 2881576