Thanks Steve, Yes, sadly I can confirm the file system has been corrupted but I still don't understand why I/O will stop flowing at the LVM level (& doesn't fence it either) and why fsck keeps crashing without a useful error message, is there any signal I can send to gfs_fsck to by pass certain stages? Also to speed up the fsck process, I was thinking of utilizing the RAM and increase the read_ahead parameter (hdparm -a) of the PV device (an AoE device) by 1GB since that will hugely optimize the sequential read and fscking is mostly a sequential read process and very bit of writings, what do you think?
Herein the tail of the last fsck log file: (metawalk.c:516) Extended attributes exist for inode #34020861. (metawalk.c:413) Checking EA leaf block #34020862. (pass1.c:485) Setting block #34020862 to eattr block (pass1.c:907) Checking metadata block 34020862 (pass1.c:923) Metadata block 34020862 not an inode or free metadata (pass1.c:907) Checking metadata block 34020863 (link.c:22) Setting link count to 1 for 34020863 (metawalk.c:516) Extended attributes exist for inode #34020863. (metawalk.c:413) Checking EA leaf block #34020864. (pass1.c:485) Setting block #34020864 to eattr block (pass1.c:907) Checking metadata block 34020864 (pass1.c:923) Metadata block 34020864 not an inode or free metadata (pass1.c:907) Checking metadata block 34020865 (link.c:22) Setting link count to 1 for 34020865 (pass1.c:213) Setting 34020917 to data block (pass1.c:213) Setting 34020918 to data block (pass1.c:213) Setting 34020919 to data block (pass1.c:213) Setting 34020920 to data block (metawalk.c:516) Extended attributes exist for inode #34020865. (metawalk.c:413) Checking EA leaf block #34020866. (pass1.c:485) Setting block #34020866 to eattr block (pass1.c:907) Checking metadata block 34020866 (pass1.c:923) Metadata block 34020866 not an inode or free metadata (pass1.c:907) Checking metadata block 34020867 Thanks, -- Abraham On 6/07/2010, at 8:22 PM, Steven Whitehouse wrote: > Hi, > > It looks to me as if the fs is corrupt in some manner. Try unmounting on > all nodes and running fsck on one node on the filesystem. Make sure you > save the output of fsck in case that is useful for future debugging and > make sure you have a backup of the data in question first. > > Its tricky to say exactly what might have gone wrong (the fsck output > might give a clue) but you will certainly need fsck to fix whatever the > problem is, > > Steve. > > On Tue, 2010-07-06 at 13:22 +1200, Abraham Alawi wrote: >> The system was running well for a while but lately we had a flaky disk in >> the RAID array which we replaced with a healthy one but suddenly the >> CLVM/GFS became unusable, we can mount GFS but while listing it recursively >> 'ls -R' it hangs with Input/output error, can't even access the c/LVM LUN >> rawly using 'dd' BUT we still can access the LVM PV devices using 'dd'. >> Reconfiguring the LVM volume as a local one and accessing it exclusively >> from one node doesn't make a difference. >> >> RHEL5: 2.6.18-164.11.1.el5 >> # modinfo gfs >> filename: /lib/modules/2.6.18-164.11.1.el5/weak-updates/gfs/gfs.ko >> license: GPL >> author: Red Hat, Inc. >> description: Global File System 0.1.34-2.el5 >> srcversion: 3B1BAC4069F1A4B556A958A >> depends: dlm >> vermagic: 2.6.18-159.el5 SMP mod_unload gcc-4.1 >> >> # uname -r >> 2.6.18-164.11.1.el5 >> >> # modinfo /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko >> filename: >> /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko >> description: AoE block/char driver for 2.6.2 and newer 2.6 kernels >> author: Sam Hopkins <[email protected]> >> license: GPL >> srcversion: 42BF122979AC807F2BB50E6 >> depends: >> vermagic: 2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1 >> parm: aoe_iflist:aoe_iflist=dev1[,dev2...] >> (string) >> parm: version:aoe module version 74 >> (string) >> parm: aoe_dyndevs:Use dynamic minor numbers for devices. (int) >> parm: aoe_deadsecs:After aoe_deadsecs seconds, give up and fail >> dev. (int) >> parm: aoe_maxout:Only aoe_maxout outstanding packets for every MAC >> on eX.Y. (int) >> parm: aoe_maxsectors:When nonzero, set the maximum number of >> sectors per I/O request in new devices. (int) >> >> # modinfo dlm >> filename: /lib/modules/2.6.18-164.11.1.el5/kernel/fs/dlm/dlm.ko >> license: GPL >> author: Red Hat, Inc. >> description: Distributed Lock Manager >> srcversion: E768995007648CA8DB078AE >> depends: configfs >> vermagic: 2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1 >> module_sig: >> 883f3504b56fe19c59c69348c13cf1f1126a509f6ddaee3965ee8b5fcd04163669647a889a9801e09f722187d1de068c0d52cd2b99bc3d475cb6ca1a0 >> >> >> >> Herein what the kernel spits out: >> >> Jul 6 11:27:36 kiwiland kernel: GFS 0.1.34-2.el5 (built Sep 9 2009 >> 06:54:42) installed >> Jul 6 11:27:36 kiwiland kernel: Lock_DLM (built Sep 9 2009 06:54:38) >> installed >> Jul 6 11:27:36 kiwiland kernel: Lock_Nolock (built Sep 9 2009 06:54:37) >> installed >> Jul 6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", >> "FSC:files" >> Jul 6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Trying to >> acquire journal lock... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Looking at >> journal... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Acquiring the >> transaction lock... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replaying >> journal... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replayed 0 of >> 11 blocks >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: replays = 0, >> skips = 4, sames = 7 >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Journal >> replayed in 1s >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Done >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Trying to >> acquire journal lock... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Looking at >> journal... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Done >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Scanning for log >> elements... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found 2 unlinked >> inodes >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found quota changes >> for 2 IDs >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Done >> Jul 6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", >> "FSC:webcluster" >> Jul 6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Trying >> to acquire journal lock... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Looking >> at journal... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Done >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Scanning for >> log elements... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found 0 >> unlinked inodes >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found quota >> changes for 0 IDs >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Done >> Jul 6 11:27:37 kiwiland kernel: Installing knfsd (copyright (C) 1996 >> [email protected]). >> Jul 6 11:27:39 kiwiland kernel: NFSD: Using /var/lib/nfs/v4recovery as the >> NFSv4 state recovery directory >> Jul 6 11:27:39 kiwiland kernel: NFSD: starting 90-second grace period >> Jul 6 11:32:21 kiwiland kernel: dlm: closing connection to node 1 >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Trying >> to acquire journal lock... >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: fatal: invalid >> metadata block >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: bh = 1432543247 >> (magic) >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: function = >> gfs_rgrp_read >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: file = >> /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/rgrp.c, line = 830 >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: time = 1278372781 >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: about to withdraw >> from the cluster >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: telling LM to >> withdraw >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Looking >> at journal... >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: >> Acquiring the transaction lock... >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: >> Replaying journal... >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replayed >> 0 of 0 blocks >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: replays >> = 0, skips = 0, sames = 0 >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Journal >> replayed in 1s >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Done >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:files.0: withdrawn >> Jul 6 11:33:02 kiwiland kernel: >> Jul 6 11:33:02 kiwiland kernel: Call Trace: >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff88805018>] >> :gfs:gfs_lm_withdraw+0xc4/0xd3 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff80063a36>] >> __wait_on_bit+0x60/0x6e >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8001538b>] sync_buffer+0x0/0x3f >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff80063ab0>] >> out_of_line_wait_on_bit+0x6c/0x78 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff800a00e5>] >> wake_bit_function+0x0/0x23 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8881cc97>] >> :gfs:gfs_meta_check_ii+0x32/0x3e >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff88819439>] >> :gfs:gfs_rgrp_read+0x139/0x225 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff887fb8e8>] >> :gfs:glock_wait_internal+0x229/0x2c3 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff887fbd17>] >> :gfs:gfs_glock_nq+0x395/0x3d6 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff887fbd6e>] >> :gfs:gfs_glock_nq_init+0x16/0x2a >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff88817466>] >> :gfs:gfs_rgrp_lvb_init+0x1e/0x3f >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8881a46f>] >> :gfs:gfs_stat_gfs+0x213/0x273 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8881353d>] >> :gfs:gfs_statfs+0x67/0xea >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff800deba3>] vfs_statfs+0x63/0x7f >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886d2ce>] >> :nfsd:nfsd_statfs+0x28/0x38 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff888745f8>] >> :nfsd:nfsd3_proc_fsstat+0x3f/0x54 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a1db>] >> :nfsd:nfsd_dispatch+0xd8/0x1d6 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff886e0529>] >> :sunrpc:svc_process+0x454/0x71b >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff80064644>] __down_read+0x12/0x92 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a746>] :nfsd:nfsd+0x1a5/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> Jul 6 11:33:02 kiwiland kernel: >> >> >> Another kernel spit out: >> Jul 5 02:01:19 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start >> time = 1278252079 >> Jul 5 03:01:16 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start >> time = 1278255676 >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: fatal: invalid >> metadata block >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: bh = 86700288 >> (magic) >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: function = >> gfs_get_meta_buffer >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: file = >> /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/dio.c, line = 1225 >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: time = 1278255737 >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: about to withdraw >> from the cluster >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: telling LM to >> withdraw >> Jul 5 03:02:21 Hercules kernel: GFS: fsid=FSC:files.0: withdrawn >> Jul 5 03:02:21 Hercules kernel: >> Jul 5 03:02:21 Hercules kernel: Call Trace: >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8880a018>] >> :gfs:gfs_lm_withdraw+0xc4/0xd3 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8001538b>] sync_buffer+0x0/0x3f >> Jul 5 03:02:21 Hercules kernel: [<ffffffff80063ab0>] >> out_of_line_wait_on_bit+0x6c/0x78 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff800a00e5>] >> wake_bit_function+0x0/0x23 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88821c97>] >> :gfs:gfs_meta_check_ii+0x32/0x3e >> Jul 5 03:02:21 Hercules kernel: [<ffffffff887f7717>] >> :gfs:gfs_get_meta_buffer+0x1d1/0x247 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88804193>] >> :gfs:gfs_copyin_dinode+0x1d/0x12f >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88800d6e>] >> :gfs:gfs_glock_nq_init+0x16/0x2a >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888043e3>] >> :gfs:inode_create+0x13e/0x1df >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88804a5d>] >> :gfs:gfs_inode_get+0x9d/0xba >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888053bb>] >> :gfs:gfs_lookupi+0x33d/0x3df >> Jul 5 03:02:21 Hercules kernel: [<ffffffff887fce57>] >> :gfs:ea_find_i+0x0/0x6b >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888172af>] >> :gfs:gfs_lookup+0x363/0x41a >> Jul 5 03:02:21 Hercules kernel: [<ffffffff80025426>] igrab+0x25/0x34 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888055a0>] >> :gfs:gfs_iget+0x3d/0x1f1 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88801224>] >> :gfs:gfs_glock_dq+0x13c/0x14b >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000cf01>] do_lookup+0xe5/0x1e6 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000a22b>] >> __link_path_walk+0xa01/0xf42 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000e9cc>] >> link_path_walk+0x42/0xb2 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000cc9c>] >> do_path_lookup+0x275/0x2f1 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff80012752>] getname+0x15b/0x1c2 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff800236ba>] >> __user_walk_fd+0x37/0x4c >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8003f235>] vfs_lstat_fd+0x18/0x47 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8002a95a>] sys_newlstat+0x19/0x31 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 >> >> >> Thanks in advance, >> >> -- Abraham >> >> '''''''''''''''''''''''''''''''''''''''''''''''''''''' >> Abraham Alawi >> >> Unix/Linux Systems Administrator >> Science IT >> University of Auckland >> e: [email protected] >> p: +64-9-373 7599, ext#: 87572 >> >> '''''''''''''''''''''''''''''''''''''''''''''''''''''' >> >> >> -- >> Linux-cluster mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/linux-cluster '''''''''''''''''''''''''''''''''''''''''''''''''''''' Abraham Alawi Unix/Linux Systems Administrator Science IT University of Auckland e: [email protected] p: +64-9-373 7599, ext#: 87572 '''''''''''''''''''''''''''''''''''''''''''''''''''''' -- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
