Hi, Our OCFS2 cluster has been stable for approx 8 months, but since this week it went wrong. First we had high load problems. We had this problem because a couple of directories got filled with files, 1 directory over 1,5 milion files (symlinks) and NFS (mounts are exported with NFS) caused high load because of that. Dir listing wasn't posible anymore. I cleaned up the directories and after that the load became normal again and everything seemed to be fine.
But within a day our customer reported continuous disappearance of files. Those files where not from directories that I have cleaned, but random at the filesystem. There are also files that are not accesible anymore and a readonly FSCK showed some inode errors. We have 3 OCFS2 filesystems mounted and 2 of them had problems. Last night I brought down the cluster, unmount the filesystems and did a filesystem check. The 2 affected filesystems reported several errors like: [DIRENT_INODE_FREE] Directory entry 'f5377cd11ee628fe7c76c7f5b47f3bee.jpg' refers to inode number 811823124 which isn't allocated, clear the entry? <y> y [INODE_ORPHANED] Inode 800661759 was found in the orphan directory. Delete its contents and unlink it? <y> y I fixed the 2 filesystems which had problems and decided to check the (thirth) filesystem which had no problems and after that something went terribly wrong. First error was like this: [SUPERBLOCK_CLUSTERS] Superblock has clusters set to 40959872 instead of 999936 recorded in global_bitmap, it may be caused by an unsuccessful resize. Trust global_bitmap? <y> And I think I have given the wrong answer. After that a lot of Inode errors and when it finished there was no data anymore! Also after a remount the filesystem is not 2.5 TB, but 500 GB. LVM is used to create a 2,5 TB filesystem of one 2 TB LUN and a 500 GB LUN: VG Size 2.44 TB But fdisk says: Disk /dev/mapper/vg04-FS1: 485.3 GB, 485322915840 bytes OCFS2: number of blocks: 118702080 bytes per block: 4096 number of clusters: 7418880 bytes per cluster: 65536 After that I tried: tunefs.ocfs2 -S /dev/vg04/FS1 tunefs.ocfs2 1.4.1 tunefs.ocfs2: Cannot shrink volume size from 118702080 blocks to 118487040 blocks tunefs.ocfs2: Nothing to do. Exiting. But no results Is there anything I can do to fix this? I have tried a lot of things, but without results. I also tried a new kernel (2.6.29.3), but after booting and mounting it crashed (dm-17 is NOT the corrupted 3rth filesystem, but the second which had no problems anymore): May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P£è\¼z May 15 23:47:31 fileserver-1 kernel: May 15 23:47:31 fileserver-1 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. May 15 23:47:31 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:31 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:31 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^SÔ¤\237\235 May 15 23:47:31 fileserver-1 kernel: May 15 23:47:31 fileserver-1 kernel: (14606,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:31 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:33 fileserver-1 kernel: (14612,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P£è\¼z May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14613,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^SÔ¤\237\235 May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P£è\¼z May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_orphan_del:1978 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_remove_inode:619 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_wipe_inode:753 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_delete_inode:990 ERROR: status = -2 May 16 00:28:39 fileserver-1 kernel: ocfs2_dlm: Nodes in domain ("296B7CF537094A9BA5F193A426D92440"): 0 May 16 00:40:19 fileserver-1 kernel: ------------[ cut here ]------------ May 16 00:40:19 fileserver-1 kernel: kernel BUG at fs/ocfs2/inode.c:244! May 16 00:40:19 fileserver-1 kernel: invalid opcode: 0000 [#1] SMP May 16 00:40:19 fileserver-1 kernel: last sysfs file: /sys/fs/o2cb/interface_revision May 16 00:40:19 fileserver-1 kernel: Modules linked in: ocfs2 jbd2 xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs dm_round_robin scsi_dh_rdac dm_multipath dm_mod scsi_dh qla2xxx May 16 00:40:19 fileserver-1 kernel: May 16 00:40:19 fileserver-1 kernel: Pid: 14609, comm: nfsd Not tainted (2.6.29.3-amd-mods-qla2xxx-mpath-fw-cluster-hm64 #1) Sun Fire V40z May 16 00:40:19 fileserver-1 kernel: EIP: 0060:[<fa8c2580>] EFLAGS: 00010246 CPU: 0 May 16 00:40:19 fileserver-1 kernel: EIP is at ocfs2_populate_inode+0x550/0x560 [ocfs2] May 16 00:40:19 fileserver-1 kernel: EAX: 00000000 EBX: f49ae000 ECX: 00000000 EDX: fa9002aa May 16 00:40:19 fileserver-1 kernel: ESI: e44eddfc EDI: f66f1000 EBP: f2821cb8 ESP: f2821c6c May 16 00:40:19 fileserver-1 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 May 16 00:40:19 fileserver-1 kernel: Process nfsd (pid: 14609, ti=f2820000 task=f6660080 task.ti=f2820000) May 16 00:40:19 fileserver-1 kernel: Stack: May 16 00:40:19 fileserver-1 kernel: 00000001 00000000 e44eda80 00000000 00000000 e44eddfc 00000001 f2821cac May 16 00:40:19 fileserver-1 kernel: f2821cf4 00000001 f2821cb8 00000000 00000001 f2821cac 00000000 fa8c07f0 May 16 00:40:19 fileserver-1 kernel: f66f1000 e44eddfc 00000001 f2821d04 fa8c2b7b 00000000 f2821ce0 f3d0b0c0 May 16 00:40:19 fileserver-1 kernel: Call Trace: May 16 00:40:19 fileserver-1 kernel: [<fa8c07f0>] ? ocfs2_validate_inode_block+0x0/0x280 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<fa8c2b7b>] ? ocfs2_iget+0x5eb/0x930 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<fa8b708a>] ? ocfs2_get_dentry+0x9a/0x1e0 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<c04d80d2>] ? skb_copy_datagram_iovec+0x132/0x1d0 May 16 00:40:19 fileserver-1 kernel: [<fa8b7277>] ? ocfs2_fh_to_dentry+0x47/0x60 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<c0251cc5>] ? exportfs_decode_fh+0x35/0x1f0 May 16 00:40:19 fileserver-1 kernel: [<c02c470f>] ? security_task_setgroups+0xf/0x20 May 16 00:40:19 fileserver-1 kernel: [<c0132de6>] ? set_groups+0x16/0x1f0 May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 May 16 00:40:19 fileserver-1 kernel: [<c013305a>] ? groups_alloc+0x3a/0xc0 May 16 00:40:19 fileserver-1 kernel: [<c025babc>] ? nfsd_setuser+0x17c/0x360 May 16 00:40:19 fileserver-1 kernel: [<c0254bca>] ? nfsd_setuser_and_check_port+0x5a/0x60 May 16 00:40:19 fileserver-1 kernel: [<c02599c4>] ? exp_find+0x54/0x80 May 16 00:40:19 fileserver-1 kernel: [<c0259a26>] ? rqst_exp_find+0x36/0xd0 May 16 00:40:19 fileserver-1 kernel: [<c0254fe4>] ? fh_verify+0x414/0x650 May 16 00:40:19 fileserver-1 kernel: [<c02556f0>] ? nfsd_acceptable+0x0/0xe0 May 16 00:40:19 fileserver-1 kernel: [<c011fa3b>] ? default_wake_function+0xb/0x10 May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 May 16 00:40:19 fileserver-1 kernel: [<c025d6f9>] ? nfsd3_proc_getattr+0x69/0xe0 May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 May 16 00:40:19 fileserver-1 kernel: [<c025208a>] ? nfsd_dispatch+0x9a/0x220 May 16 00:40:19 fileserver-1 kernel: [<c0251ff0>] ? nfsd_dispatch+0x0/0x220 May 16 00:40:19 fileserver-1 kernel: [<c057106b>] ? svc_process+0x3eb/0x6c0 May 16 00:40:19 fileserver-1 kernel: [<c0252746>] ? nfsd+0x136/0x240 May 16 00:40:19 fileserver-1 kernel: [<c011c5d8>] ? complete+0x48/0x60 May 16 00:40:19 fileserver-1 kernel: [<c0252610>] ? nfsd+0x0/0x240 May 16 00:40:19 fileserver-1 kernel: [<c0138972>] ? kthread+0x42/0x70 May 16 00:40:19 fileserver-1 kernel: [<c0138930>] ? kthread+0x0/0x70 May 16 00:40:19 fileserver-1 kernel: [<c010389b>] ? kernel_thread_helper+0x7/0x1c May 16 00:40:19 fileserver-1 kernel: Code: 8f fa 85 d2 ba 20 dc 8f fa 0f 44 c2 89 86 9c 00 00 00 e9 39 ff ff ff 83 8e 44 01 00 00 20 e9 a1 fc ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 90 8d b4 26 00 00 00 00 55 89 e5 57 56 May 16 00:40:19 fileserver-1 kernel: EIP: [<fa8c2580>] ocfs2_populate_inode+0x550/0x560 [ocfs2] SS:ESP 0068:f2821c6c May 16 00:40:19 fileserver-1 kernel: ---[ end trace 3b05f9cfd74396a1 ]--- NFS with OCFS2 problems? I went back to my previous kernel 2.6.25.5 and it seemed to be stable. At this moment I have 2 mounted (production) filesystems and 1 umounted corrupted filesystem. This morning I looked in the logs and again errors! Many like this: (249,1):ocfs2_orphan_del:1869 ERROR: status = -2 (249,1):ocfs2_remove_inode:610 ERROR: status = -2 (249,1):ocfs2_wipe_inode:736 ERROR: status = -2 (249,1):ocfs2_delete_inode:970 ERROR: status = -2 This came from the 2 filesystems that seemed to be clean last night. - What can I do to prevent filesystem corruption on my 2 production OCFS2 filesystems and get rid of the above errors? - Is it possible to fix the corrupted thirth filesystem? - What is the most stable kernel (or setup) in my case? Now (and the last year) I am using 2.6.25.5. The 2.6.29.3 kernel I've tried crashed after a couple of minutes. Versions: OS: Debian Etch (4.0) kernel: custom 2.6.25.5 o2cb_ctl version 1.4.1 ocfs2-tools 1.4.1 OCFS2 DLM 1.5.0 OCFS2 DLMFS 1.5.0 I hope that you can help me with these problems. Best regards, Christian van Barneveld _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users