Hello,

 We just experienced a hang that looks superficially very similar to 
http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg02359.html

 There are 3 nodes in the cluster ocfs2-1.4.1 rhel 5.2. Versions, uname's
in the attached text file which also includes fs_locks dumps and various
other diagnostics. 

The lock up happened when we were restarting a java application that 
was writing to the /journal directory, being read by another java app
on a second node.  Restarting the machine that the 
jvm was running on did not help - indicating a locking issue. 

ls of the directory hangs the process on the machine that was writing.
An ls on the machine that was reading initially worked. An rm command
on the reader then caused that to lock up as well. 

Here's an extract showing what they're waiting on.

 2222 D    bash            ocfs2_wait_for_mask
 2282 Zl   java <defunct>  exit
 2567 Zl   java <defunct>  exit
 2736 D    ls              ocfs2_wait_for_mask
 2770 D    ls              ocfs2_wait_for_mask

Andy

 


________________________________________________________________________
In order to protect our email recipients, Betfair Group use SkyScan from 
MessageLabs to scan all Incoming and Outgoing mail for viruses.

________________________________________________________________________
[EMAIL PROTECTED] ~]# ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
  PID STAT COMMAND         WIDE-WCHAN-COLUMN
    1 Ss   init            -
    2 S<   migration/0     migration_thread
    3 SN   ksoftirqd/0     ksoftirqd
    4 S<   watchdog/0      watchdog
    5 S<   events/0        worker_thread
    6 S<   khelper         worker_thread
    7 S<   kthread         worker_thread
    9 S<   xenwatch        xenwatch_thread
   10 S<   xenbus          xb_read
   19 S<   migration/1     migration_thread
   20 SN   ksoftirqd/1     ksoftirqd
   21 S<   watchdog/1      watchdog
   22 S<   events/1        worker_thread
   23 S<   migration/2     migration_thread
   24 SN   ksoftirqd/2     ksoftirqd
   25 S<   watchdog/2      watchdog
   26 S<   events/2        worker_thread
   27 S<   migration/3     migration_thread
   28 SN   ksoftirqd/3     ksoftirqd
   29 S<   watchdog/3      watchdog
   30 S<   events/3        worker_thread
   35 S<   kblockd/0       worker_thread
   36 S<   kblockd/1       worker_thread
   37 S<   kblockd/2       worker_thread
   38 S<   kblockd/3       worker_thread
   39 S<   cqueue/0        worker_thread
   40 S<   cqueue/1        worker_thread
   41 S<   cqueue/2        worker_thread
   42 S<   cqueue/3        worker_thread
   46 S<   khubd           hub_thread
   48 S<   kseriod         serio_thread
  124 S    pdflush         pdflush
  125 S    pdflush         pdflush
  126 S<   kswapd0         kswapd
  127 S<   aio/0           worker_thread
  128 S<   aio/1           worker_thread
  129 S<   aio/2           worker_thread
  130 S<   aio/3           worker_thread
  260 S<   kpsmoused       worker_thread
  314 S<   ksnapd          worker_thread
  317 S<   kjournald       kjournald
  342 S<   kauditd         kauditd_thread
  371 S<s  udevd           -
  812 S<   kmpathd/0       worker_thread
  813 S<   kmpathd/1       worker_thread
  814 S<   kmpathd/2       worker_thread
  815 S<   kmpathd/3       worker_thread
  840 S<   kjournald       kjournald
  982 S<   ib_addr         worker_thread
 1000 S<   ib_mcast        worker_thread
 1001 S<   ib_inform       worker_thread
 1002 S<   local_sa        worker_thread
 1007 S<   iw_cm_wq        worker_thread
 1013 S<   ib_cm/0         worker_thread
 1015 S<   ib_cm/1         worker_thread
 1016 S<   ib_cm/2         worker_thread
 1017 S<   ib_cm/3         worker_thread
 1023 S<   rdma_cm         worker_thread
 1033 Ss   iscsid          -
 1034 S<Ls iscsid          68407357167632383
 1640 S<sl auditd          stext
 1642 S<sl audispd         18446612140812126016
 1663 Ss   syslogd         -
 1666 Ss   klogd           syslog
 1677 Ss   irqbalance      -
 1706 Ss   portmap         9233302164451854337
 1726 Ss   rpc.statd       -
 1763 Ss   rpc.idmapd      -
 1780 Ss   dbus-daemon     313532581889
 1824 S<   user_dlm        worker_thread
 1834 S<   o2net           worker_thread
 1859 S<   o2hb-6EAF64F9C6 -
 1868 S<   ocfs2_wq        worker_thread
 1869 S<   ocfs2dc         ocfs2_downconvert_thread
 1870 S<   dlm_thread      -
 1871 S<   dlm_reco_thread -
 1872 S<   dlm_wq          worker_thread
 1873 S<   kjournald       kjournald
 1874 S<   ocfs2cmt        ocfs2_commit_thread
 1905 Ssl  pcscd           stext
 1938 Ss   hidd            9232503764391266305
 1967 Ssl  nscd            stext
 1990 Sl   snmpd           stext
 2023 Ss   sshd            -
 2044 Ss   sendmail        -
 2052 Ss   sendmail        pause
 2070 Ss   gpm             -
 2119 S    python          -
 2129 Ss   crond           -
 2148 Ss   atd             -
 2159 Ss   rhnsd           -
 2169 Ss   hald            17474222057506996223
 2170 S    hald-runner     -
 2190 Ss+  agetty          -
 2221 S    su              wait
 2222 D    bash            ocfs2_wait_for_mask
 2282 Zl   java <defunct>  exit
 2567 Zl   java <defunct>  exit
 2736 D    ls              ocfs2_wait_for_mask
 2770 D    ls              ocfs2_wait_for_mask
 2798 Ss   sshd            -
 2800 S    sshd            -
 2801 Ss   bash            wait
 2824 S    su              wait
 2825 D+   bash            ocfs2_wait_for_mask
 2852 Ss   sshd            -
 2854 S    sshd            -
 2855 Ss   bash            wait
 2877 S    su              wait
 2878 S    bash            wait
 2932 S+   strace          wait
 2933 Ss   sshd            -
 2935 Ss+  bash            -
 2979 Ss   sshd            -
 2981 S    sshd            -
 2982 Ss   bash            wait
 3010 S    su              wait
 3011 S    bash            wait
 3053 R+   ps              -
[EMAIL PROTECTED] ~]# ls /sys/       
block  class    firmware  hypervisor  module  power
bus    devices  fs        kernel      o2cb
[EMAIL PROTECTED] ~]# ls /sys/o2cb/

[EMAIL PROTECTED] kernel]# rpm -qa | grep ocfs2
ocfs2-2.6.18-92.1.10.el5xen-1.4.1-1.el5
ocfs2console-1.4.1-1.el5
ocfs2-tools-1.4.1-1.el5

[EMAIL PROTECTED] kernel]# uname -a
Linux gs2ems101.gs2.tradefair 2.6.18-92.1.10.el5xen #1 SMP Wed Jul 23 04:11:52 
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[EMAIL PROTECTED] kernel]# mount
/dev/mapper/vg.base-lv.root on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/xvda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
configfs on /sys/kernel/config type configfs (rw)
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/xvdb1 on /var/tradex/journal/ems type ocfs2 (rw,_netdev,heartbeat=local)
debugfs on /sys/kernel/debug type debugfs (rw)

[EMAIL PROTECTED] kernel]# cat debug/o2net/sock_containers 
ffff8801fb2e1000:
  krefs:           3
  sock:            10.80.42.200:7778 -> 10.80.42.202:44101
  remote node:     gs2ems103
  page off:        0
  handshake ok:    1
  timer:           1220296829.716143
  data ready:      1220296829.716135
  advance start:   1220296829.716143
  advance stop:    1220296829.716144
  func start:      1220294620.416497
  func stop:       1220294620.416504
  func key:        3625018370
  func type:       505
ffff8801f352d400:
  krefs:           3
  sock:            10.80.42.200:7778 -> 10.80.42.201:50046
  remote node:     gs2ems102
  page off:        0
  handshake ok:    1
  timer:           1220296830.684202
  data ready:      1220296830.684196
  advance start:   1220296830.684202
  advance stop:    1220296830.684203
  func start:      1220294737.64839
  func stop:       1220294737.64841
  func key:        3625018370
  func type:       505

[EMAIL PROTECTED] kernel]# cat debug/o2net/send_tracking 
[EMAIL PROTECTED] kernel]# 
[EMAIL PROTECTED] kernel]# echo fs_locks | debugfs.ocfs2 /dev/xvdb1 | grep -i10 
busy
debugfs.ocfs2 1.4.1
Lockres: W0000000000000000100207725f3fd8  Mode: Invalid
Flags: Initialized
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Invalid  Blocking Mode: Invalid
PR > Gets: 0  Fails: 0    Waits (usec) Total: 0  Max: 0
EX > Gets: 0  Fails: 0    Waits (usec) Total: 0  Max: 0
Disk Refreshes: 0

Lockres: M000000000000000010020700000000  Mode: No Lock
Flags: Initialized Attached Busy
RO Holders: 0  EX Holders: 0
Pending Action: Convert  Pending Unlock Action: None
Requested Mode: Protected Read  Blocking Mode: No Lock
PR > Gets: 320  Fails: 0    Waits (usec) Total: 0  Max: 0
EX > Gets: 2  Fails: 0    Waits (usec) Total: 0  Max: 0
Disk Refreshes: 0

Lockres: M000000000000000000005cc58bf613  Mode: Invalid
Flags: Initialized
RO Holders: 0  EX Holders: 0
[EMAIL PROTECTED] kernel]# debugfs.ocfs2 
debugfs.ocfs2 1.4.1
debugfs: open /dev/xvdb1
debugfs: stat <M000000000000000010020700000000>

        Inode: 1049095   Mode: 0777   Generation: 1918844888 (0x725f3fd8)
        FS Generation: 3314284051 (0xc58bf613)
        Type: Directory   Attr: 0x0   Flags: Valid 
        User: 512 (tradex)   Group: 512 (tradex)   Size: 4096
        Links: 2   Clusters: 1
        ctime: 0x48bc37bf -- Mon Sep  1 18:43:11 2008
        atime: 0x48bc378d -- Mon Sep  1 18:42:21 2008
        mtime: 0x48bc37bf -- Mon Sep  1 18:43:11 2008
        dtime: 0x0 -- Thu Jan  1 00:00:00 1970
        ctime_nsec: 0x15afeb17 -- 363850519
        atime_nsec: 0x21e642d7 -- 568738519
        mtime_nsec: 0x15afeb17 -- 363850519
        Last Extblk: 0
        Sub Alloc Slot: 0   Sub Alloc Bit: 2
        Tree Depth: 0   Count: 243   Next Free Rec: 1
        ## Offset        Clusters       Block#          Flags
        0  0             1              1128961         0x0

debugfs: locate <M000000000000000010020700000000>
        1049095 /journal/
[EMAIL PROTECTED] kernel]# cat 
/sys/kernel/debug/o2dlm/6EAF64F9C61F4421A45B97A4418ADE4F/dlm_state 
Domain: 6EAF64F9C61F4421A45B97A4418ADE4F  Key: 0xd8116402
Thread Pid: 1870  Node: 0  State: JOINED
Number of Joins: 1  Joining Node: 255
Domain Map: 0 1 2 
Live Map: 0 1 2 
Mastered Resources Total: 33  Locally: 0  Remotely: 33  Unknown: 0
Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty  
Master=Empty
Purge Count: 0  Refs: 1
Dead Node: 255
Recovery Pid: 1871  Master: 255  State: INACTIVE
Recovery Map: 
Recovery Node State:

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to