Hello,
We just experienced a hang that looks superficially very similar to
http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg02359.html
There are 3 nodes in the cluster ocfs2-1.4.1 rhel 5.2. Versions, uname's
in the attached text file which also includes fs_locks dumps and various
other diagnostics.
The lock up happened when we were restarting a java application that
was writing to the /journal directory, being read by another java app
on a second node. Restarting the machine that the
jvm was running on did not help - indicating a locking issue.
ls of the directory hangs the process on the machine that was writing.
An ls on the machine that was reading initially worked. An rm command
on the reader then caused that to lock up as well.
Here's an extract showing what they're waiting on.
2222 D bash ocfs2_wait_for_mask
2282 Zl java <defunct> exit
2567 Zl java <defunct> exit
2736 D ls ocfs2_wait_for_mask
2770 D ls ocfs2_wait_for_mask
Andy
________________________________________________________________________
In order to protect our email recipients, Betfair Group use SkyScan from
MessageLabs to scan all Incoming and Outgoing mail for viruses.
________________________________________________________________________
[EMAIL PROTECTED] ~]# ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
PID STAT COMMAND WIDE-WCHAN-COLUMN
1 Ss init -
2 S< migration/0 migration_thread
3 SN ksoftirqd/0 ksoftirqd
4 S< watchdog/0 watchdog
5 S< events/0 worker_thread
6 S< khelper worker_thread
7 S< kthread worker_thread
9 S< xenwatch xenwatch_thread
10 S< xenbus xb_read
19 S< migration/1 migration_thread
20 SN ksoftirqd/1 ksoftirqd
21 S< watchdog/1 watchdog
22 S< events/1 worker_thread
23 S< migration/2 migration_thread
24 SN ksoftirqd/2 ksoftirqd
25 S< watchdog/2 watchdog
26 S< events/2 worker_thread
27 S< migration/3 migration_thread
28 SN ksoftirqd/3 ksoftirqd
29 S< watchdog/3 watchdog
30 S< events/3 worker_thread
35 S< kblockd/0 worker_thread
36 S< kblockd/1 worker_thread
37 S< kblockd/2 worker_thread
38 S< kblockd/3 worker_thread
39 S< cqueue/0 worker_thread
40 S< cqueue/1 worker_thread
41 S< cqueue/2 worker_thread
42 S< cqueue/3 worker_thread
46 S< khubd hub_thread
48 S< kseriod serio_thread
124 S pdflush pdflush
125 S pdflush pdflush
126 S< kswapd0 kswapd
127 S< aio/0 worker_thread
128 S< aio/1 worker_thread
129 S< aio/2 worker_thread
130 S< aio/3 worker_thread
260 S< kpsmoused worker_thread
314 S< ksnapd worker_thread
317 S< kjournald kjournald
342 S< kauditd kauditd_thread
371 S<s udevd -
812 S< kmpathd/0 worker_thread
813 S< kmpathd/1 worker_thread
814 S< kmpathd/2 worker_thread
815 S< kmpathd/3 worker_thread
840 S< kjournald kjournald
982 S< ib_addr worker_thread
1000 S< ib_mcast worker_thread
1001 S< ib_inform worker_thread
1002 S< local_sa worker_thread
1007 S< iw_cm_wq worker_thread
1013 S< ib_cm/0 worker_thread
1015 S< ib_cm/1 worker_thread
1016 S< ib_cm/2 worker_thread
1017 S< ib_cm/3 worker_thread
1023 S< rdma_cm worker_thread
1033 Ss iscsid -
1034 S<Ls iscsid 68407357167632383
1640 S<sl auditd stext
1642 S<sl audispd 18446612140812126016
1663 Ss syslogd -
1666 Ss klogd syslog
1677 Ss irqbalance -
1706 Ss portmap 9233302164451854337
1726 Ss rpc.statd -
1763 Ss rpc.idmapd -
1780 Ss dbus-daemon 313532581889
1824 S< user_dlm worker_thread
1834 S< o2net worker_thread
1859 S< o2hb-6EAF64F9C6 -
1868 S< ocfs2_wq worker_thread
1869 S< ocfs2dc ocfs2_downconvert_thread
1870 S< dlm_thread -
1871 S< dlm_reco_thread -
1872 S< dlm_wq worker_thread
1873 S< kjournald kjournald
1874 S< ocfs2cmt ocfs2_commit_thread
1905 Ssl pcscd stext
1938 Ss hidd 9232503764391266305
1967 Ssl nscd stext
1990 Sl snmpd stext
2023 Ss sshd -
2044 Ss sendmail -
2052 Ss sendmail pause
2070 Ss gpm -
2119 S python -
2129 Ss crond -
2148 Ss atd -
2159 Ss rhnsd -
2169 Ss hald 17474222057506996223
2170 S hald-runner -
2190 Ss+ agetty -
2221 S su wait
2222 D bash ocfs2_wait_for_mask
2282 Zl java <defunct> exit
2567 Zl java <defunct> exit
2736 D ls ocfs2_wait_for_mask
2770 D ls ocfs2_wait_for_mask
2798 Ss sshd -
2800 S sshd -
2801 Ss bash wait
2824 S su wait
2825 D+ bash ocfs2_wait_for_mask
2852 Ss sshd -
2854 S sshd -
2855 Ss bash wait
2877 S su wait
2878 S bash wait
2932 S+ strace wait
2933 Ss sshd -
2935 Ss+ bash -
2979 Ss sshd -
2981 S sshd -
2982 Ss bash wait
3010 S su wait
3011 S bash wait
3053 R+ ps -
[EMAIL PROTECTED] ~]# ls /sys/
block class firmware hypervisor module power
bus devices fs kernel o2cb
[EMAIL PROTECTED] ~]# ls /sys/o2cb/
[EMAIL PROTECTED] kernel]# rpm -qa | grep ocfs2
ocfs2-2.6.18-92.1.10.el5xen-1.4.1-1.el5
ocfs2console-1.4.1-1.el5
ocfs2-tools-1.4.1-1.el5
[EMAIL PROTECTED] kernel]# uname -a
Linux gs2ems101.gs2.tradefair 2.6.18-92.1.10.el5xen #1 SMP Wed Jul 23 04:11:52
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[EMAIL PROTECTED] kernel]# mount
/dev/mapper/vg.base-lv.root on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/xvda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
configfs on /sys/kernel/config type configfs (rw)
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/xvdb1 on /var/tradex/journal/ems type ocfs2 (rw,_netdev,heartbeat=local)
debugfs on /sys/kernel/debug type debugfs (rw)
[EMAIL PROTECTED] kernel]# cat debug/o2net/sock_containers
ffff8801fb2e1000:
krefs: 3
sock: 10.80.42.200:7778 -> 10.80.42.202:44101
remote node: gs2ems103
page off: 0
handshake ok: 1
timer: 1220296829.716143
data ready: 1220296829.716135
advance start: 1220296829.716143
advance stop: 1220296829.716144
func start: 1220294620.416497
func stop: 1220294620.416504
func key: 3625018370
func type: 505
ffff8801f352d400:
krefs: 3
sock: 10.80.42.200:7778 -> 10.80.42.201:50046
remote node: gs2ems102
page off: 0
handshake ok: 1
timer: 1220296830.684202
data ready: 1220296830.684196
advance start: 1220296830.684202
advance stop: 1220296830.684203
func start: 1220294737.64839
func stop: 1220294737.64841
func key: 3625018370
func type: 505
[EMAIL PROTECTED] kernel]# cat debug/o2net/send_tracking
[EMAIL PROTECTED] kernel]#
[EMAIL PROTECTED] kernel]# echo fs_locks | debugfs.ocfs2 /dev/xvdb1 | grep -i10
busy
debugfs.ocfs2 1.4.1
Lockres: W0000000000000000100207725f3fd8 Mode: Invalid
Flags: Initialized
RO Holders: 0 EX Holders: 0
Pending Action: None Pending Unlock Action: None
Requested Mode: Invalid Blocking Mode: Invalid
PR > Gets: 0 Fails: 0 Waits (usec) Total: 0 Max: 0
EX > Gets: 0 Fails: 0 Waits (usec) Total: 0 Max: 0
Disk Refreshes: 0
Lockres: M000000000000000010020700000000 Mode: No Lock
Flags: Initialized Attached Busy
RO Holders: 0 EX Holders: 0
Pending Action: Convert Pending Unlock Action: None
Requested Mode: Protected Read Blocking Mode: No Lock
PR > Gets: 320 Fails: 0 Waits (usec) Total: 0 Max: 0
EX > Gets: 2 Fails: 0 Waits (usec) Total: 0 Max: 0
Disk Refreshes: 0
Lockres: M000000000000000000005cc58bf613 Mode: Invalid
Flags: Initialized
RO Holders: 0 EX Holders: 0
[EMAIL PROTECTED] kernel]# debugfs.ocfs2
debugfs.ocfs2 1.4.1
debugfs: open /dev/xvdb1
debugfs: stat <M000000000000000010020700000000>
Inode: 1049095 Mode: 0777 Generation: 1918844888 (0x725f3fd8)
FS Generation: 3314284051 (0xc58bf613)
Type: Directory Attr: 0x0 Flags: Valid
User: 512 (tradex) Group: 512 (tradex) Size: 4096
Links: 2 Clusters: 1
ctime: 0x48bc37bf -- Mon Sep 1 18:43:11 2008
atime: 0x48bc378d -- Mon Sep 1 18:42:21 2008
mtime: 0x48bc37bf -- Mon Sep 1 18:43:11 2008
dtime: 0x0 -- Thu Jan 1 00:00:00 1970
ctime_nsec: 0x15afeb17 -- 363850519
atime_nsec: 0x21e642d7 -- 568738519
mtime_nsec: 0x15afeb17 -- 363850519
Last Extblk: 0
Sub Alloc Slot: 0 Sub Alloc Bit: 2
Tree Depth: 0 Count: 243 Next Free Rec: 1
## Offset Clusters Block# Flags
0 0 1 1128961 0x0
debugfs: locate <M000000000000000010020700000000>
1049095 /journal/
[EMAIL PROTECTED] kernel]# cat
/sys/kernel/debug/o2dlm/6EAF64F9C61F4421A45B97A4418ADE4F/dlm_state
Domain: 6EAF64F9C61F4421A45B97A4418ADE4F Key: 0xd8116402
Thread Pid: 1870 Node: 0 State: JOINED
Number of Joins: 1 Joining Node: 255
Domain Map: 0 1 2
Live Map: 0 1 2
Mastered Resources Total: 33 Locally: 0 Remotely: 33 Unknown: 0
Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty
Master=Empty
Purge Count: 0 Refs: 1
Dead Node: 255
Recovery Pid: 1871 Master: 255 State: INACTIVE
Recovery Map:
Recovery Node State:
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users