Re: [Ocfs2-users] dlm locking
o2image is only useful for debugging. It allows us to get a copy of the file system on which we can test fsck inhouse. The files in lost+found have to be resolved manually. If they are junk, delete them. If useful, move it to another directory. On 11/11/2011 05:36 PM, Nick Khamis wrote: All Fixed! Just a few questions. Is there any documentation on howto diagnose on ocfs2 filesystem: * How to transfer an image file for testing onto a different machine. As you did with o2image.out * Does fsck.ocfs2 -fy /dev/loop0 pretty much fix all the common problems * What can I do with the files in lost+found Thanks Again, Nick. On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushransunil.mush...@oracle.com wrote: So it detected one cluster that was doubly allocated. It fixed it. Details below. The other fixes could be because the o2image was taken on a live volume. As to how this could happen... I would look at the storage. # fsck.ocfs2 -fy /dev/loop0 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/loop0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/loop0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Duplicate clusters detected. Pass 1b will be run Running additional passes to resolve clusters claimed by more than one inode... Pass 1b: Determining ownership of multiply-claimed clusters Pass 1c: Determining the names of inodes owning multiply-claimed clusters Pass 1d: Reconciling multiply-claimed clusters Cluster 161335 is claimed by the following inodes: /asterisk/extensions.conf /moh/macroform-cold_day.wav [DUP_CLUSTERS_CLONE] Inode /asterisk/extensions.conf may be cloned or deleted to break the claim it has on its clusters. Clone inode /asterisk/extensions.conf to break claims on clusters it shares with other inodes? y [DUP_CLUSTERS_CLONE] Inode /moh/macroform-cold_day.wav may be cloned or deleted to break the claim it has on its clusters. Clone inode /moh/macroform-cold_day.wav to break claims on clusters it shares with other inodes? y Pass 2: Checking directory entries. [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode number 35348 which isn't allocated, clear the entry? y Pass 3: Checking directory connectivity. [LOSTFOUND_MISSING] /lost+found does not exist. Create it so that we can possibly fill it with orphaned inodes? y Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to match? y [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries. Move it to lost+found? y All passes succeeded. Slot 0's journal dirty flag removed Slot 1's journal dirty flag removed [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/loop0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/loop0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. All passes succeeded. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
All Fixed! Just a few questions. Is there any documentation on howto diagnose on ocfs2 filesystem: * How to transfer an image file for testing onto a different machine. As you did with o2image.out * Does fsck.ocfs2 -fy /dev/loop0 pretty much fix all the common problems * What can I do with the files in lost+found Thanks Again, Nick. On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushran sunil.mush...@oracle.com wrote: So it detected one cluster that was doubly allocated. It fixed it. Details below. The other fixes could be because the o2image was taken on a live volume. As to how this could happen... I would look at the storage. # fsck.ocfs2 -fy /dev/loop0 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/loop0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/loop0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Duplicate clusters detected. Pass 1b will be run Running additional passes to resolve clusters claimed by more than one inode... Pass 1b: Determining ownership of multiply-claimed clusters Pass 1c: Determining the names of inodes owning multiply-claimed clusters Pass 1d: Reconciling multiply-claimed clusters Cluster 161335 is claimed by the following inodes: /asterisk/extensions.conf /moh/macroform-cold_day.wav [DUP_CLUSTERS_CLONE] Inode /asterisk/extensions.conf may be cloned or deleted to break the claim it has on its clusters. Clone inode /asterisk/extensions.conf to break claims on clusters it shares with other inodes? y [DUP_CLUSTERS_CLONE] Inode /moh/macroform-cold_day.wav may be cloned or deleted to break the claim it has on its clusters. Clone inode /moh/macroform-cold_day.wav to break claims on clusters it shares with other inodes? y Pass 2: Checking directory entries. [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode number 35348 which isn't allocated, clear the entry? y Pass 3: Checking directory connectivity. [LOSTFOUND_MISSING] /lost+found does not exist. Create it so that we can possibly fill it with orphaned inodes? y Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to match? y [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries. Move it to lost+found? y All passes succeeded. Slot 0's journal dirty flag removed Slot 1's journal dirty flag removed [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/loop0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/loop0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. All passes succeeded. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
Do: fsck.ocfs2 -f /dev/... Without -f, it only replays the journal. On 11/09/2011 05:49 PM, Nick Khamis wrote: Hello Sunil, This is only on the protoype so it's not crucial however, it would be nice to figure out why for future reference: fsck.ocfs2 /dev/drbd0 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/drbd0 is clean. It will be checked after 20 additional mounts. I can mount it and write to it just fine (read and write). It's just when I start the application that reads from the filesystem (I don't think there is any writing going on), that it goes into read mode... It use to work, other than the update to 1.6.4 I am not sure what I have changed.. Not quite sure what kind of information you would need to help figure out the problem? Cheers, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
Hello Sunil, Thank you so much for your time, and I do not want to take any more of it. I ran fsck with -f and have the following: fsck.ocfs2 -f /dev/drbd0 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd0: Label: ASTServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/drbd0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Duplicate clusters detected. Pass 1b will be run Running additional passes to resolve clusters claimed by more than one inode... Pass 1b: Determining ownership of multiply-claimed clusters pass1b: Inode type does not contain extents while processing inode 5 fsck.ocfs2: Inode type does not contain extents while performing pass 1 Not sure if the read-only is due to the detected duplicate? Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
The ro issue was different. It appears the volume has more problems. If you want to me to look at the issue, I'll need the image of the volume. # o2image /dev/device /tmp/o2image.out On 11/10/2011 01:55 PM, Nick Khamis wrote: Hello Sunil, Thank you so much for your time, and I do not want to take any more of it. I ran fsck with -f and have the following: fsck.ocfs2 -f /dev/drbd0 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd0: Label: ASTServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/drbd0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Duplicate clusters detected. Pass 1b will be run Running additional passes to resolve clusters claimed by more than one inode... Pass 1b: Determining ownership of multiply-claimed clusters pass1b: Inode type does not contain extents while processing inode 5 fsck.ocfs2: Inode type does not contain extents while performing pass 1 Not sure if the read-only is due to the detected duplicate? Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] dlm locking
Hello Everyone, For the first time I eoerienced a dlm lock: [ 9721.831813] OCFS2 DLM 1.5.0 [ 9721.917032] ocfs2: Registered cluster interface o2cb [ 9722.170848] OCFS2 DLMFS 1.5.0 [ 9722.179018] OCFS2 User DLM kernel interface loaded [ 9755.743195] ocfs2_dlm: Nodes in domain (3A791AB36DED41008E58CEF52EBEEFD3): 1 [ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with ordered data mode. [ 9783.240424] block drbd0: Handshake successful: Agreed network protocol version 91 [ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC [ 9783.243074] block drbd0: conn( WFConnection - WFReportParams ) [ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver [4390]) [ 9783.271014] block drbd0: data-integrity-alg: not-used [ 9783.271298] block drbd0: drbd_sync_handshake: [ 9783.271318] block drbd0: self 964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705 bits:3 flags:0 [ 9783.271342] block drbd0: peer B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705 bits:0 flags:0 [ 9783.271364] block drbd0: uuid_compare()=100 by rule 90 [ 9783.271380] block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node [ 9783.271417] block drbd0: peer( Unknown - Secondary ) conn( WFReportParams - WFBitMapS ) [ 9783.399967] block drbd0: peer( Secondary - Primary ) [ 9783.515979] block drbd0: conn( WFBitMapS - SyncSource ) pdsk( Outdated - Inconsistent ) [ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12 KB [3 bits set]). [ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent! [ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec) [ 9783.799956] block drbd0: conn( SyncSource - Connected ) pdsk( Inconsistent - UpToDate ) [ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2) at 192.168.2.111: [ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3 [ 9800.231668] ocfs2_dlm: Nodes in domain (3A791AB36DED41008E58CEF52EBEEFD3): 1 2 [ 9861.922744] OCFS2: ERROR (device drbd0): ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not set [ 9861.922767] [ 9861.927278] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. [ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22 Not sure where to start, but with your appreciated help I am sure we can get it resolved. Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
This has nothing to do with the dlm. The error states that the fs encountered a bad inode on disk. Possible disk corruption. On encountering the fs goes readonly and asks the user to run fsck. On 11/09/2011 11:51 AM, Nick Khamis wrote: Hello Everyone, For the first time I eoerienced a dlm lock: [ 9721.831813] OCFS2 DLM 1.5.0 [ 9721.917032] ocfs2: Registered cluster interface o2cb [ 9722.170848] OCFS2 DLMFS 1.5.0 [ 9722.179018] OCFS2 User DLM kernel interface loaded [ 9755.743195] ocfs2_dlm: Nodes in domain (3A791AB36DED41008E58CEF52EBEEFD3): 1 [ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with ordered data mode. [ 9783.240424] block drbd0: Handshake successful: Agreed network protocol version 91 [ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC [ 9783.243074] block drbd0: conn( WFConnection - WFReportParams ) [ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver [4390]) [ 9783.271014] block drbd0: data-integrity-alg:not-used [ 9783.271298] block drbd0: drbd_sync_handshake: [ 9783.271318] block drbd0: self 964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705 bits:3 flags:0 [ 9783.271342] block drbd0: peer B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705 bits:0 flags:0 [ 9783.271364] block drbd0: uuid_compare()=100 by rule 90 [ 9783.271380] block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node [ 9783.271417] block drbd0: peer( Unknown - Secondary ) conn( WFReportParams - WFBitMapS ) [ 9783.399967] block drbd0: peer( Secondary - Primary ) [ 9783.515979] block drbd0: conn( WFBitMapS - SyncSource ) pdsk( Outdated - Inconsistent ) [ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12 KB [3 bits set]). [ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent! [ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec) [ 9783.799956] block drbd0: conn( SyncSource - Connected ) pdsk( Inconsistent - UpToDate ) [ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2) at 192.168.2.111: [ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3 [ 9800.231668] ocfs2_dlm: Nodes in domain (3A791AB36DED41008E58CEF52EBEEFD3): 1 2 [ 9861.922744] OCFS2: ERROR (device drbd0): ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not set [ 9861.922767] [ 9861.927278] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. [ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22 Not sure where to start, but with your appreciated help I am sure we can get it resolved. Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
Hello Sunil, This is only on the protoype so it's not crucial however, it would be nice to figure out why for future reference: fsck.ocfs2 /dev/drbd0 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/drbd0 is clean. It will be checked after 20 additional mounts. I can mount it and write to it just fine (read and write). It's just when I start the application that reads from the filesystem (I don't think there is any writing going on), that it goes into read mode... It use to work, other than the update to 1.6.4 I am not sure what I have changed.. Not quite sure what kind of information you would need to help figure out the problem? Cheers, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking bug?
Log what you have in a bz. I can take a look. I doubt you will be able to attach that file though. You'll need to provide me with a link. On 09/02/2011 07:28 AM, Sérgio Surkamp wrote: Hello, We have got a problem this morning with our cluster. Cluster setup: Servers: * Two R800 Dell servers running CentOS 5.5 and ULEK 2.6.32-100.0.19.el5, with 8G ram each; * OCFS2 1.6.4; * iSCSI connection using two bonded Gbit nics. Storage: * Dell EqualLogic 4000VX -- iSCSI Network: * Two dell 1Gbit trunked switches; Problem description: The node #1 has hanged access to the filesystem and the hung tasks has almost the same stack trace as one of the following: --- INFO: task maildirsize:17252 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. maildirsize D 0004 0 17252 17249 0x0080 8800b181d7f8 0086 880098de3c40 88001293a1c0 88022bb7c4c0 88001293a598 a036a654 8800b181d7e8 81043d10 88001293a1f8 7fff Call Trace: [a036a654] ? dlmlock+0x12e2/0x13bb [ocfs2_dlm] [81043d10] ? update_curr+0xc9/0xd2 [8143798a] schedule_timeout+0x36/0xe7 [810425b3] ? need_resched+0x23/0x2d [814377fc] wait_for_common+0xb7/0x12c [8104bc15] ? default_wake_function+0x0/0x19 [a03b07fe] ? lockres_clear_flags+0x15/0x17 [ocfs2] [81437914] wait_for_completion+0x1d/0x1f [a03b15ad] ocfs2_wait_for_mask+0x1a/0x29 [ocfs2] [a03b1f59] __ocfs2_cluster_lock+0x83c/0x861 [ocfs2] [a03c6319] ? ocfs2_inode_cache_io_unlock+0x12/0x14 [ocfs2] [a03f48f5] ? ocfs2_metadata_cache_io_unlock+0x1e/0x20 [ocfs2] [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2] [a03c3242] ? ocfs2_read_inode_block_full+0x3e/0x5a [ocfs2] [a03b38dc] ocfs2_inode_lock_full_nested+0x194/0xb8d [ocfs2] [a03d3f8c] ? ocfs2_rename+0x49e/0x183d [ocfs2] [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2] [a03d3f8c] ocfs2_rename+0x49e/0x183d [ocfs2] [a03a6792] ? brelse+0x13/0x15 [ocfs2] [a03b0692] ? init_completion+0x1f/0x21 [ocfs2] [a03b0741] ? ocfs2_init_mask_waiter+0x26/0x3f [ocfs2] [a03b0692] ? init_completion+0x1f/0x21 [ocfs2] [a03b202c] ? ocfs2_should_refresh_lock_res+0x8f/0x1ad [ocfs2] [810425b3] ? need_resched+0x23/0x2d [810e8512] ? kstrdup+0x2b/0xc0 [811229eb] vfs_rename+0x221/0x3c0 [81124968] sys_renameat+0x18b/0x201 [81075a7c] ? autoremove_wake_function+0x0/0x3d [811178fc] ? fsnotify_modify+0x6c/0x74 [8112186b] ? path_put+0x22/0x27 [811249f9] sys_rename+0x1b/0x1d [81011db2] system_call_fastpath+0x16/0x1b INFO: task imapd:17386 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. imapd D 000b 0 17386 4367 0x0080 880208709c08 0086 0286 8801501ac800 88012b008180 8801501acbd8 0001bbc4c49f 880127bf8d9c Call Trace: [81437e31] __mutex_lock_common+0x12f/0x1a1 [81437ef2] __mutex_lock_slowpath+0x19/0x1b [81437f5b] mutex_lock+0x23/0x3a [81121aaa] do_lookup+0x85/0x162 [8112445c] __link_path_walk+0x49e/0x5fb [81238204] ? __strncpy_from_user+0x31/0x4a [8112460c] path_walk+0x53/0x9c [81124727] do_path_lookup+0x2f/0x7a [8112514f] user_path_at+0x57/0x91 [810f0001] ? handle_mm_fault+0x14b/0x7d9 [8111c366] vfs_fstatat+0x3a/0x67 [a03b0567] ? ocfs2_inode_unlock+0x140/0x1a5 [ocfs2] [8111c479] vfs_stat+0x1b/0x1d [8111c49a] sys_newstat+0x1f/0x39 [8143b1e3] ? do_page_fault+0x25d/0x26c [810a7d03] ? audit_syscall_entry+0x103/0x12f [81011db2] system_call_fastpath+0x16/0x1b --- When we rebooted the node#1, the fallowing recovery messages was logged by node#0: --- o2net: connection to node XX (num 1) at ip.ip.ip.2: has been idle for 60.0 seconds, shutting it down. (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1314962116.650772 now 1314962176.650058 dr 1314962116.650749 adv 1314962116.650781:1314962116.650782 func (3f8ab666:504) 1314962114.651682:1314962114.651687) o2net: no longer connected to node XX (num 1) at ip.ip.ip.2:
Re: [Ocfs2-users] dlm locking bug?
Done. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1333 Thanks. Regards, Sérgio Surkamp Em Fri, 02 Sep 2011 10:12:18 -0700 Sunil Mushran sunil.mush...@oracle.com escreveu: Log what you have in a bz. I can take a look. I doubt you will be able to attach that file though. You'll need to provide me with a link. On 09/02/2011 07:28 AM, Sérgio Surkamp wrote: Hello, We have got a problem this morning with our cluster. Cluster setup: Servers: * Two R800 Dell servers running CentOS 5.5 and ULEK 2.6.32-100.0.19.el5, with 8G ram each; * OCFS2 1.6.4; * iSCSI connection using two bonded Gbit nics. Storage: * Dell EqualLogic 4000VX -- iSCSI Network: * Two dell 1Gbit trunked switches; Problem description: The node #1 has hanged access to the filesystem and the hung tasks has almost the same stack trace as one of the following: --- INFO: task maildirsize:17252 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. maildirsize D 0004 0 17252 17249 0x0080 8800b181d7f8 0086 880098de3c40 88001293a1c0 88022bb7c4c0 88001293a598 a036a654 8800b181d7e8 81043d10 88001293a1f8 7fff Call Trace: [a036a654] ? dlmlock+0x12e2/0x13bb [ocfs2_dlm] [81043d10] ? update_curr+0xc9/0xd2 [8143798a] schedule_timeout+0x36/0xe7 [810425b3] ? need_resched+0x23/0x2d [814377fc] wait_for_common+0xb7/0x12c [8104bc15] ? default_wake_function+0x0/0x19 [a03b07fe] ? lockres_clear_flags+0x15/0x17 [ocfs2] [81437914] wait_for_completion+0x1d/0x1f [a03b15ad] ocfs2_wait_for_mask+0x1a/0x29 [ocfs2] [a03b1f59] __ocfs2_cluster_lock+0x83c/0x861 [ocfs2] [a03c6319] ? ocfs2_inode_cache_io_unlock+0x12/0x14 [ocfs2] [a03f48f5] ? ocfs2_metadata_cache_io_unlock+0x1e/0x20 [ocfs2] [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2] [a03c3242] ? ocfs2_read_inode_block_full+0x3e/0x5a [ocfs2] [a03b38dc] ocfs2_inode_lock_full_nested+0x194/0xb8d [ocfs2] [a03d3f8c] ? ocfs2_rename+0x49e/0x183d [ocfs2] [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2] [a03d3f8c] ocfs2_rename+0x49e/0x183d [ocfs2] [a03a6792] ? brelse+0x13/0x15 [ocfs2] [a03b0692] ? init_completion+0x1f/0x21 [ocfs2] [a03b0741] ? ocfs2_init_mask_waiter+0x26/0x3f [ocfs2] [a03b0692] ? init_completion+0x1f/0x21 [ocfs2] [a03b202c] ? ocfs2_should_refresh_lock_res+0x8f/0x1ad [ocfs2] [810425b3] ? need_resched+0x23/0x2d [810e8512] ? kstrdup+0x2b/0xc0 [811229eb] vfs_rename+0x221/0x3c0 [81124968] sys_renameat+0x18b/0x201 [81075a7c] ? autoremove_wake_function+0x0/0x3d [811178fc] ? fsnotify_modify+0x6c/0x74 [8112186b] ? path_put+0x22/0x27 [811249f9] sys_rename+0x1b/0x1d [81011db2] system_call_fastpath+0x16/0x1b INFO: task imapd:17386 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. imapd D 000b 0 17386 4367 0x0080 880208709c08 0086 0286 8801501ac800 88012b008180 8801501acbd8 0001bbc4c49f 880127bf8d9c Call Trace: [81437e31] __mutex_lock_common+0x12f/0x1a1 [81437ef2] __mutex_lock_slowpath+0x19/0x1b [81437f5b] mutex_lock+0x23/0x3a [81121aaa] do_lookup+0x85/0x162 [8112445c] __link_path_walk+0x49e/0x5fb [81238204] ? __strncpy_from_user+0x31/0x4a [8112460c] path_walk+0x53/0x9c [81124727] do_path_lookup+0x2f/0x7a [8112514f] user_path_at+0x57/0x91 [810f0001] ? handle_mm_fault+0x14b/0x7d9 [8111c366] vfs_fstatat+0x3a/0x67 [a03b0567] ? ocfs2_inode_unlock+0x140/0x1a5 [ocfs2] [8111c479] vfs_stat+0x1b/0x1d [8111c49a] sys_newstat+0x1f/0x39 [8143b1e3] ? do_page_fault+0x25d/0x26c [810a7d03] ? audit_syscall_entry+0x103/0x12f [81011db2] system_call_fastpath+0x16/0x1b --- When we rebooted the node#1, the fallowing recovery messages was logged by node#0: --- o2net: connection to node XX (num 1) at ip.ip.ip.2: has been idle for 60.0 seconds, shutting it down. (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1314962116.650772 now 1314962176.650058