Re: [Ocfs2-users] dlm locking

2011-11-14 Thread Sunil Mushran
o2image is only useful for debugging. It allows us to get a copy of the file 
system
on which we can test fsck inhouse. The files in lost+found have to be resolved
manually. If they are junk, delete them. If useful, move it to another 
directory.

On 11/11/2011 05:36 PM, Nick Khamis wrote:
 All Fixed!

 Just a few questions. Is there any documentation on howto diagnose on
 ocfs2 filesystem:
 * How to transfer an image file for testing onto a different machine.
 As you did with o2image.out
 * Does fsck.ocfs2 -fy /dev/loop0 pretty much fix all the common problems
 * What can I do with the files in lost+found

 Thanks Again,

 Nick.

 On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushransunil.mush...@oracle.com  
 wrote:
 So it detected one cluster that was doubly allocated. It fixed it.
 Details below. The other fixes could be because the o2image was
 taken on a live volume.

 As to how this could happen... I would look at the storage.


 # fsck.ocfs2 -fy /dev/loop0
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/loop0:
   Label:  AsteriskServer
   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
   Number of blocks:   592384
   Block size: 4096
   Number of clusters: 592384
   Cluster size:   4096
   Number of slots:2

 /dev/loop0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Duplicate clusters detected.  Pass 1b will be run
 Running additional passes to resolve clusters claimed by more than one
 inode...
 Pass 1b: Determining ownership of multiply-claimed clusters
 Pass 1c: Determining the names of inodes owning multiply-claimed clusters
 Pass 1d: Reconciling multiply-claimed clusters
 Cluster 161335 is claimed by the following inodes:
   /asterisk/extensions.conf
   /moh/macroform-cold_day.wav
 [DUP_CLUSTERS_CLONE] Inode /asterisk/extensions.conf may be cloned or
 deleted to break the claim it has on its clusters. Clone inode
 /asterisk/extensions.conf to break claims on clusters it shares with other
 inodes? y
 [DUP_CLUSTERS_CLONE] Inode /moh/macroform-cold_day.wav may be cloned or
 deleted to break the claim it has on its clusters. Clone inode
 /moh/macroform-cold_day.wav to break claims on clusters it shares with
 other inodes? y
 Pass 2: Checking directory entries.
 [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode
 number 35348 which isn't allocated, clear the entry? y
 Pass 3: Checking directory connectivity.
 [LOSTFOUND_MISSING] /lost+found does not exist.  Create it so that we can
 possibly fill it with orphaned inodes? y
 Pass 4a: checking for orphaned inodes
 Pass 4b: Checking inodes link counts.
 [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry
 references come to 2. Update the count on disk to match? y
 [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries.
   Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries.
   Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries.
   Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries.
   Move it to lost+found? y
 All passes succeeded.
 Slot 0's journal dirty flag removed
 Slot 1's journal dirty flag removed


 [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/loop0:
   Label:  AsteriskServer
   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
   Number of blocks:   592384
   Block size: 4096
   Number of clusters: 592384
   Cluster size:   4096
   Number of slots:2

 /dev/loop0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Pass 2: Checking directory entries.
 Pass 3: Checking directory connectivity.
 Pass 4a: checking for orphaned inodes
 Pass 4b: Checking inodes link counts.
 All passes succeeded.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-11 Thread Nick Khamis
All Fixed!

Just a few questions. Is there any documentation on howto diagnose on
ocfs2 filesystem:
* How to transfer an image file for testing onto a different machine.
As you did with o2image.out
* Does fsck.ocfs2 -fy /dev/loop0 pretty much fix all the common problems
* What can I do with the files in lost+found

Thanks Again,

Nick.

On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushran sunil.mush...@oracle.com wrote:
 So it detected one cluster that was doubly allocated. It fixed it.
 Details below. The other fixes could be because the o2image was
 taken on a live volume.

 As to how this could happen... I would look at the storage.


 # fsck.ocfs2 -fy /dev/loop0
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/loop0:
  Label:  AsteriskServer
  UUID:   3A791AB36DED41008E58CEF52EBEEFD3
  Number of blocks:   592384
  Block size: 4096
  Number of clusters: 592384
  Cluster size:   4096
  Number of slots:2

 /dev/loop0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Duplicate clusters detected.  Pass 1b will be run
 Running additional passes to resolve clusters claimed by more than one
 inode...
 Pass 1b: Determining ownership of multiply-claimed clusters
 Pass 1c: Determining the names of inodes owning multiply-claimed clusters
 Pass 1d: Reconciling multiply-claimed clusters
 Cluster 161335 is claimed by the following inodes:
  /asterisk/extensions.conf
  /moh/macroform-cold_day.wav
 [DUP_CLUSTERS_CLONE] Inode /asterisk/extensions.conf may be cloned or
 deleted to break the claim it has on its clusters. Clone inode
 /asterisk/extensions.conf to break claims on clusters it shares with other
 inodes? y
 [DUP_CLUSTERS_CLONE] Inode /moh/macroform-cold_day.wav may be cloned or
 deleted to break the claim it has on its clusters. Clone inode
 /moh/macroform-cold_day.wav to break claims on clusters it shares with
 other inodes? y
 Pass 2: Checking directory entries.
 [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode
 number 35348 which isn't allocated, clear the entry? y
 Pass 3: Checking directory connectivity.
 [LOSTFOUND_MISSING] /lost+found does not exist.  Create it so that we can
 possibly fill it with orphaned inodes? y
 Pass 4a: checking for orphaned inodes
 Pass 4b: Checking inodes link counts.
 [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry
 references come to 2. Update the count on disk to match? y
 [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries.
  Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries.
  Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries.
  Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries.
  Move it to lost+found? y
 All passes succeeded.
 Slot 0's journal dirty flag removed
 Slot 1's journal dirty flag removed


 [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/loop0:
  Label:  AsteriskServer
  UUID:   3A791AB36DED41008E58CEF52EBEEFD3
  Number of blocks:   592384
  Block size: 4096
  Number of clusters: 592384
  Cluster size:   4096
  Number of slots:2

 /dev/loop0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Pass 2: Checking directory entries.
 Pass 3: Checking directory connectivity.
 Pass 4a: checking for orphaned inodes
 Pass 4b: Checking inodes link counts.
 All passes succeeded.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Sunil Mushran
Do:

fsck.ocfs2 -f /dev/...

Without -f, it only replays the journal.

On 11/09/2011 05:49 PM, Nick Khamis wrote:
 Hello Sunil,

 This is only on the protoype so it's not crucial however, it would be
 nice to figure out why for
 future reference:

 fsck.ocfs2 /dev/drbd0
 fsck.ocfs2 1.6.4
 Checking OCFS2 filesystem in /dev/drbd0:
   Label:  AsteriskServer
   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
   Number of blocks:   592384
   Block size: 4096
   Number of clusters: 592384
   Cluster size:   4096
   Number of slots:2

 /dev/drbd0 is clean.  It will be checked after 20 additional mounts.

 I can mount it and write to it just fine (read and write). It's just
 when I start the application that reads from the filesystem
 (I don't think there is any writing going on), that it goes into read
 mode... It use to work, other than the update to 1.6.4 I am not sure
 what I have changed..

 Not quite sure what kind of information you would need to help figure
 out the problem?

 Cheers,

 Nick.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Nick Khamis
Hello Sunil,

Thank you so much for your time, and I do not want to take any more
of it. I ran fsck with -f and have the following:

fsck.ocfs2 -f /dev/drbd0
fsck.ocfs2 1.6.4
Checking OCFS2 filesystem in /dev/drbd0:
  Label:  ASTServer
  UUID:   3A791AB36DED41008E58CEF52EBEEFD3
  Number of blocks:   592384
  Block size: 4096
  Number of clusters: 592384
  Cluster size:   4096
  Number of slots:2

/dev/drbd0 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
Duplicate clusters detected.  Pass 1b will be run
Running additional passes to resolve clusters claimed by more than one inode...
Pass 1b: Determining ownership of multiply-claimed clusters
pass1b: Inode type does not contain extents while processing inode 5
fsck.ocfs2: Inode type does not contain extents while performing pass 1

Not sure if the read-only is due to the detected duplicate?

Thanks in Advance,

Nick.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Sunil Mushran
The ro issue was different. It appears the volume has more problems.
If you want to me to look at the issue, I'll need the image of the volume.
# o2image /dev/device  /tmp/o2image.out

On 11/10/2011 01:55 PM, Nick Khamis wrote:
 Hello Sunil,

 Thank you so much for your time, and I do not want to take any more
 of it. I ran fsck with -f and have the following:

 fsck.ocfs2 -f /dev/drbd0
 fsck.ocfs2 1.6.4
 Checking OCFS2 filesystem in /dev/drbd0:
Label:  ASTServer
UUID:   3A791AB36DED41008E58CEF52EBEEFD3
Number of blocks:   592384
Block size: 4096
Number of clusters: 592384
Cluster size:   4096
Number of slots:2

 /dev/drbd0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Duplicate clusters detected.  Pass 1b will be run
 Running additional passes to resolve clusters claimed by more than one 
 inode...
 Pass 1b: Determining ownership of multiply-claimed clusters
 pass1b: Inode type does not contain extents while processing inode 5
 fsck.ocfs2: Inode type does not contain extents while performing pass 1

 Not sure if the read-only is due to the detected duplicate?

 Thanks in Advance,

 Nick.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] dlm locking

2011-11-09 Thread Nick Khamis
Hello Everyone,

For the first time I eoerienced a dlm lock:

[ 9721.831813] OCFS2 DLM 1.5.0
[ 9721.917032] ocfs2: Registered cluster interface o2cb
[ 9722.170848] OCFS2 DLMFS 1.5.0
[ 9722.179018] OCFS2 User DLM kernel interface loaded
[ 9755.743195] ocfs2_dlm: Nodes in domain
(3A791AB36DED41008E58CEF52EBEEFD3): 1
[ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with
ordered data mode.
[ 9783.240424] block drbd0: Handshake successful: Agreed network
protocol version 91
[ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
[ 9783.243074] block drbd0: conn( WFConnection - WFReportParams )
[ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver [4390])
[ 9783.271014] block drbd0: data-integrity-alg: not-used
[ 9783.271298] block drbd0: drbd_sync_handshake:
[ 9783.271318] block drbd0: self
964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705
bits:3 flags:0
[ 9783.271342] block drbd0: peer
B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705
bits:0 flags:0
[ 9783.271364] block drbd0: uuid_compare()=100 by rule 90
[ 9783.271380] block drbd0: Split-Brain detected, 1 primaries,
automatically solved. Sync from this node
[ 9783.271417] block drbd0: peer( Unknown - Secondary ) conn(
WFReportParams - WFBitMapS )
[ 9783.399967] block drbd0: peer( Secondary - Primary )
[ 9783.515979] block drbd0: conn( WFBitMapS - SyncSource ) pdsk(
Outdated - Inconsistent )
[ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12
KB [3 bits set]).
[ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent!
[ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec)
[ 9783.799956] block drbd0: conn( SyncSource - Connected ) pdsk(
Inconsistent - UpToDate )
[ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2)
at 192.168.2.111:
[ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3
[ 9800.231668] ocfs2_dlm: Nodes in domain
(3A791AB36DED41008E58CEF52EBEEFD3): 1 2
[ 9861.922744] OCFS2: ERROR (device drbd0):
ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not
set
[ 9861.922767]
[ 9861.927278] File system is now read-only due to the potential of
on-disk corruption. Please run fsck.ocfs2 once the file system is
unmounted.
[ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22

Not sure where to start, but with your appreciated help I am sure we
can get it resolved.

Thanks in Advance,

Nick.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-09 Thread Sunil Mushran
This has nothing to do with the dlm. The error states that the fs encountered
a bad inode on disk. Possible disk corruption. On encountering the fs goes 
readonly
and asks the user to run fsck.

On 11/09/2011 11:51 AM, Nick Khamis wrote:
 Hello Everyone,

 For the first time I eoerienced a dlm lock:

 [ 9721.831813] OCFS2 DLM 1.5.0
 [ 9721.917032] ocfs2: Registered cluster interface o2cb
 [ 9722.170848] OCFS2 DLMFS 1.5.0
 [ 9722.179018] OCFS2 User DLM kernel interface loaded
 [ 9755.743195] ocfs2_dlm: Nodes in domain
 (3A791AB36DED41008E58CEF52EBEEFD3): 1
 [ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with
 ordered data mode.
 [ 9783.240424] block drbd0: Handshake successful: Agreed network
 protocol version 91
 [ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
 [ 9783.243074] block drbd0: conn( WFConnection -  WFReportParams )
 [ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver 
 [4390])
 [ 9783.271014] block drbd0: data-integrity-alg:not-used
 [ 9783.271298] block drbd0: drbd_sync_handshake:
 [ 9783.271318] block drbd0: self
 964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705
 bits:3 flags:0
 [ 9783.271342] block drbd0: peer
 B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705
 bits:0 flags:0
 [ 9783.271364] block drbd0: uuid_compare()=100 by rule 90
 [ 9783.271380] block drbd0: Split-Brain detected, 1 primaries,
 automatically solved. Sync from this node
 [ 9783.271417] block drbd0: peer( Unknown -  Secondary ) conn(
 WFReportParams -  WFBitMapS )
 [ 9783.399967] block drbd0: peer( Secondary -  Primary )
 [ 9783.515979] block drbd0: conn( WFBitMapS -  SyncSource ) pdsk(
 Outdated -  Inconsistent )
 [ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12
 KB [3 bits set]).
 [ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent!
 [ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec)
 [ 9783.799956] block drbd0: conn( SyncSource -  Connected ) pdsk(
 Inconsistent -  UpToDate )
 [ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2)
 at 192.168.2.111:
 [ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3
 [ 9800.231668] ocfs2_dlm: Nodes in domain
 (3A791AB36DED41008E58CEF52EBEEFD3): 1 2
 [ 9861.922744] OCFS2: ERROR (device drbd0):
 ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not
 set
 [ 9861.922767]
 [ 9861.927278] File system is now read-only due to the potential of
 on-disk corruption. Please run fsck.ocfs2 once the file system is
 unmounted.
 [ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22

 Not sure where to start, but with your appreciated help I am sure we
 can get it resolved.

 Thanks in Advance,

 Nick.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-09 Thread Nick Khamis
Hello Sunil,

This is only on the protoype so it's not crucial however, it would be
nice to figure out why for
future reference:

fsck.ocfs2 /dev/drbd0
fsck.ocfs2 1.6.4
Checking OCFS2 filesystem in /dev/drbd0:
 Label:  AsteriskServer
 UUID:   3A791AB36DED41008E58CEF52EBEEFD3
 Number of blocks:   592384
 Block size: 4096
 Number of clusters: 592384
 Cluster size:   4096
 Number of slots:2

/dev/drbd0 is clean.  It will be checked after 20 additional mounts.

I can mount it and write to it just fine (read and write). It's just
when I start the application that reads from the filesystem
(I don't think there is any writing going on), that it goes into read
mode... It use to work, other than the update to 1.6.4 I am not sure
what I have changed..

Not quite sure what kind of information you would need to help figure
out the problem?

Cheers,

Nick.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking bug?

2011-09-02 Thread Sunil Mushran
Log what you have in a bz. I can take a look. I doubt you will be able to
attach that file though. You'll need to provide me with a link.

On 09/02/2011 07:28 AM, Sérgio Surkamp wrote:
 Hello,

 We have got a problem this morning with our cluster.

 Cluster setup:

 Servers:
 * Two R800 Dell servers running CentOS 5.5 and ULEK
2.6.32-100.0.19.el5, with 8G ram each;
 * OCFS2 1.6.4;
 * iSCSI connection using two bonded Gbit nics.

 Storage:
 * Dell EqualLogic 4000VX -- iSCSI

 Network:
 * Two dell 1Gbit trunked switches;

 Problem description:

 The node #1 has hanged access to the filesystem and the hung tasks has
 almost the same stack trace as one of the following:

 ---
 INFO: task maildirsize:17252 blocked for more than 120 seconds.
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this
 message.
 maildirsize   D 0004 0 17252  17249 0x0080
   8800b181d7f8 0086  880098de3c40
   88001293a1c0 88022bb7c4c0 88001293a598 a036a654
   8800b181d7e8 81043d10 88001293a1f8 7fff
 Call Trace:
   [a036a654] ? dlmlock+0x12e2/0x13bb [ocfs2_dlm]
   [81043d10] ? update_curr+0xc9/0xd2
   [8143798a] schedule_timeout+0x36/0xe7
   [810425b3] ? need_resched+0x23/0x2d
   [814377fc] wait_for_common+0xb7/0x12c
   [8104bc15] ? default_wake_function+0x0/0x19
   [a03b07fe] ? lockres_clear_flags+0x15/0x17 [ocfs2]
   [81437914] wait_for_completion+0x1d/0x1f
   [a03b15ad] ocfs2_wait_for_mask+0x1a/0x29 [ocfs2]
   [a03b1f59] __ocfs2_cluster_lock+0x83c/0x861 [ocfs2]
   [a03c6319] ? ocfs2_inode_cache_io_unlock+0x12/0x14 [ocfs2]
   [a03f48f5] ? ocfs2_metadata_cache_io_unlock+0x1e/0x20 [ocfs2]
   [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2]
   [a03c3242] ? ocfs2_read_inode_block_full+0x3e/0x5a [ocfs2]
   [a03b38dc] ocfs2_inode_lock_full_nested+0x194/0xb8d [ocfs2]
   [a03d3f8c] ? ocfs2_rename+0x49e/0x183d [ocfs2]
   [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2]
   [a03d3f8c] ocfs2_rename+0x49e/0x183d [ocfs2]
   [a03a6792] ? brelse+0x13/0x15 [ocfs2]
   [a03b0692] ? init_completion+0x1f/0x21 [ocfs2]
   [a03b0741] ? ocfs2_init_mask_waiter+0x26/0x3f [ocfs2]
   [a03b0692] ? init_completion+0x1f/0x21 [ocfs2]
   [a03b202c] ? ocfs2_should_refresh_lock_res+0x8f/0x1ad [ocfs2]
   [810425b3] ? need_resched+0x23/0x2d
   [810e8512] ? kstrdup+0x2b/0xc0
   [811229eb] vfs_rename+0x221/0x3c0
   [81124968] sys_renameat+0x18b/0x201
   [81075a7c] ? autoremove_wake_function+0x0/0x3d
   [811178fc] ? fsnotify_modify+0x6c/0x74
   [8112186b] ? path_put+0x22/0x27
   [811249f9] sys_rename+0x1b/0x1d
   [81011db2] system_call_fastpath+0x16/0x1b

 INFO: task imapd:17386 blocked for more than 120 seconds.
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this
 message.
 imapd D 000b 0 17386   4367 0x0080
   880208709c08 0086  0286
   8801501ac800 88012b008180 8801501acbd8 0001bbc4c49f
      880127bf8d9c
 Call Trace:
   [81437e31] __mutex_lock_common+0x12f/0x1a1
   [81437ef2] __mutex_lock_slowpath+0x19/0x1b
   [81437f5b] mutex_lock+0x23/0x3a
   [81121aaa] do_lookup+0x85/0x162
   [8112445c] __link_path_walk+0x49e/0x5fb
   [81238204] ? __strncpy_from_user+0x31/0x4a
   [8112460c] path_walk+0x53/0x9c
   [81124727] do_path_lookup+0x2f/0x7a
   [8112514f] user_path_at+0x57/0x91
   [810f0001] ? handle_mm_fault+0x14b/0x7d9
   [8111c366] vfs_fstatat+0x3a/0x67
   [a03b0567] ? ocfs2_inode_unlock+0x140/0x1a5 [ocfs2]
   [8111c479] vfs_stat+0x1b/0x1d
   [8111c49a] sys_newstat+0x1f/0x39
   [8143b1e3] ? do_page_fault+0x25d/0x26c
   [810a7d03] ? audit_syscall_entry+0x103/0x12f
   [81011db2] system_call_fastpath+0x16/0x1b
 ---

 When we rebooted the node#1, the fallowing recovery messages was
 logged by node#0:

 ---
 o2net: connection to node XX (num 1) at ip.ip.ip.2: has
 been idle for 60.0 seconds, shutting it down.
 (swapper,0,0):o2net_idle_timer:1498 here are some times that might help
 debug the situation: (tmr 1314962116.650772 now 1314962176.650058 dr
 1314962116.650749 adv 1314962116.650781:1314962116.650782 func
 (3f8ab666:504) 1314962114.651682:1314962114.651687)
 o2net: no longer connected to node XX (num 1) at
 ip.ip.ip.2:
 

Re: [Ocfs2-users] dlm locking bug?

2011-09-02 Thread Sérgio Surkamp
Done.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1333

Thanks.

Regards,
Sérgio Surkamp

Em Fri, 02 Sep 2011 10:12:18 -0700
Sunil Mushran sunil.mush...@oracle.com escreveu:

 Log what you have in a bz. I can take a look. I doubt you will be
 able to attach that file though. You'll need to provide me with a
 link.
 
 On 09/02/2011 07:28 AM, Sérgio Surkamp wrote:
  Hello,
 
  We have got a problem this morning with our cluster.
 
  Cluster setup:
 
  Servers:
  * Two R800 Dell servers running CentOS 5.5 and ULEK
 2.6.32-100.0.19.el5, with 8G ram each;
  * OCFS2 1.6.4;
  * iSCSI connection using two bonded Gbit nics.
 
  Storage:
  * Dell EqualLogic 4000VX -- iSCSI
 
  Network:
  * Two dell 1Gbit trunked switches;
 
  Problem description:
 
  The node #1 has hanged access to the filesystem and the hung tasks
  has almost the same stack trace as one of the following:
 
  ---
  INFO: task maildirsize:17252 blocked for more than 120 seconds.
  echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this
  message.
  maildirsize   D 0004 0 17252  17249 0x0080
8800b181d7f8 0086 
  880098de3c40 88001293a1c0 88022bb7c4c0 88001293a598
  a036a654 8800b181d7e8 81043d10 88001293a1f8
  7fff Call Trace:
[a036a654] ? dlmlock+0x12e2/0x13bb [ocfs2_dlm]
[81043d10] ? update_curr+0xc9/0xd2
[8143798a] schedule_timeout+0x36/0xe7
[810425b3] ? need_resched+0x23/0x2d
[814377fc] wait_for_common+0xb7/0x12c
[8104bc15] ? default_wake_function+0x0/0x19
[a03b07fe] ? lockres_clear_flags+0x15/0x17 [ocfs2]
[81437914] wait_for_completion+0x1d/0x1f
[a03b15ad] ocfs2_wait_for_mask+0x1a/0x29 [ocfs2]
[a03b1f59] __ocfs2_cluster_lock+0x83c/0x861 [ocfs2]
[a03c6319] ? ocfs2_inode_cache_io_unlock+0x12/0x14
  [ocfs2] [a03f48f5] ?
  ocfs2_metadata_cache_io_unlock+0x1e/0x20 [ocfs2]
  [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2]
  [a03c3242] ? ocfs2_read_inode_block_full+0x3e/0x5a
  [ocfs2] [a03b38dc]
  ocfs2_inode_lock_full_nested+0x194/0xb8d [ocfs2]
  [a03d3f8c] ? ocfs2_rename+0x49e/0x183d [ocfs2]
  [a03c344c] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2]
  [a03d3f8c] ocfs2_rename+0x49e/0x183d [ocfs2]
  [a03a6792] ? brelse+0x13/0x15 [ocfs2]
  [a03b0692] ? init_completion+0x1f/0x21 [ocfs2]
  [a03b0741] ? ocfs2_init_mask_waiter+0x26/0x3f [ocfs2]
  [a03b0692] ? init_completion+0x1f/0x21 [ocfs2]
  [a03b202c] ? ocfs2_should_refresh_lock_res+0x8f/0x1ad
  [ocfs2] [810425b3] ? need_resched+0x23/0x2d
  [810e8512] ? kstrdup+0x2b/0xc0 [811229eb]
  vfs_rename+0x221/0x3c0 [81124968]
  sys_renameat+0x18b/0x201 [81075a7c] ?
  autoremove_wake_function+0x0/0x3d [811178fc] ?
  fsnotify_modify+0x6c/0x74 [8112186b] ? path_put+0x22/0x27
  [811249f9] sys_rename+0x1b/0x1d [81011db2]
  system_call_fastpath+0x16/0x1b
 
  INFO: task imapd:17386 blocked for more than 120 seconds.
  echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this
  message.
  imapd D 000b 0 17386   4367 0x0080
880208709c08 0086 
  0286 8801501ac800 88012b008180 8801501acbd8
  0001bbc4c49f   
  880127bf8d9c Call Trace:
[81437e31] __mutex_lock_common+0x12f/0x1a1
[81437ef2] __mutex_lock_slowpath+0x19/0x1b
[81437f5b] mutex_lock+0x23/0x3a
[81121aaa] do_lookup+0x85/0x162
[8112445c] __link_path_walk+0x49e/0x5fb
[81238204] ? __strncpy_from_user+0x31/0x4a
[8112460c] path_walk+0x53/0x9c
[81124727] do_path_lookup+0x2f/0x7a
[8112514f] user_path_at+0x57/0x91
[810f0001] ? handle_mm_fault+0x14b/0x7d9
[8111c366] vfs_fstatat+0x3a/0x67
[a03b0567] ? ocfs2_inode_unlock+0x140/0x1a5 [ocfs2]
[8111c479] vfs_stat+0x1b/0x1d
[8111c49a] sys_newstat+0x1f/0x39
[8143b1e3] ? do_page_fault+0x25d/0x26c
[810a7d03] ? audit_syscall_entry+0x103/0x12f
[81011db2] system_call_fastpath+0x16/0x1b
  ---
 
  When we rebooted the node#1, the fallowing recovery messages was
  logged by node#0:
 
  ---
  o2net: connection to node XX (num 1) at ip.ip.ip.2: has
  been idle for 60.0 seconds, shutting it down.
  (swapper,0,0):o2net_idle_timer:1498 here are some times that might
  help debug the situation: (tmr 1314962116.650772 now
  1314962176.650058