Hello, I looked for the infos you requested. 1) The eviction message was on all nodes. Playing with grep I noticed in some nodes it appeared twice with different numbers in parenthesis:
Mar 4 04:10:22 node05 kernel: (22595,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node05 kernel: (22328,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6900,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6892,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 2) The recovery master message appeared on one node, here is the log at that time. Please note that node10 (hostname) is Node 3 (ocfs2 settings) Mar 4 04:09:51 node10 kernel: o2net: connection to node node08 (num 9) at 192.168.1.8:7777 has been idle for 30.0 seconds, shutting it down. Mar 4 04:09:51 node10 kernel: (0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1267672161.718025 now 1267672191.723171 dr 1267672161.718019 adv 1267672161.718025:12676 72161.718026 func (a6c57cb2:502) 1267552114.706439:1267552114.706441) Mar 4 04:09:51 node10 kernel: o2net: no longer connected to node node08 (num 9) at 192.168.1.8:7777 Mar 4 04:10:21 node10 kernel: (30475,0):o2net_connect_expired:1659 ERROR: no connection established with node 9 after 30.0 seconds, giving up and returning errors. Mar 4 04:10:23 node10 kernel: (30740,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:839 B2F5C3291557493B99AE7326AF8B7471:$RECOVERY: at least one node (9) to recover before lock mastery can begin Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:873 B2F5C3291557493B99AE7326AF8B7471: recovery map is not empty, but must master $RECOVERY lock now Mar 4 04:10:23 node10 kernel: (30772,0):dlm_do_recovery:524 (30772) Node 3 is the Recovery Master for the Dead Node 9 for Domain B2F5C3291557493B99AE7326AF8B7471 And the log doesnt contain anything til the morning. Instead, another node contains the following: Mar 4 04:10:29 node05 kernel: (1861,1):ocfs2_replay_journal:1224 Recovering node 9 from slot 7 on device (152,0) But the ocfs2 disk was unavailable anyway. Any other hint? Regards, G. On Wed, Mar 10, 2010 at 8:56 PM, Sunil Mushran <sunil.mush...@oracle.com> wrote: > Were the first set of messages on all nodes? On that node atleast > the o2hb node down event fired. It should have fired on all nodes. > This is the dlm eviction message. > > If they all fired, then look for a node to have a message that > reads "Node x is the Recovery Master for the Dead Node y". > > That shows a node was elected to run the dlm recovery. That has > to complete before the journal is replayed. "Recovering node x > from slot y on device". > > I did a quick scan of the patches since 2.6.28. They are a lot > of them. I did not see any fixes in this area. > git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2 > > Sunil > > Gabriele Alberti wrote: >> >> Hello, >> I have a weird behavior in my ocfs2 cluster. I have few nodes >> accessing a shared device, and everything works fine as long as one >> node crashes for whatever reason. When this happens, the ocfs2 >> filesystem hangs and it seems impossible to access it until I dont >> bring down all the nodes but one. I have a (commented) log of what >> happened few nights ago, when a node shut itself down because of a fan >> failure. In order to avoid uncontrolled re-joins to the cluster my >> nodes stay off when they go off for a reason. >> >> The log is available at http://pastebin.com/gDg577hH >> >> Is this the expected behavior? I thought when one node fails, the rest >> of the world should go on working after the timeout (I used default >> values for timeouts). >> >> Here are my versions >> >> # modinfo ocfs2 >> filename: /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko >> author: Oracle >> license: GPL >> description: OCFS2 1.5.0 >> version: 1.5.0 >> vermagic: 2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS >> depends: jbd2,ocfs2_stackglue,ocfs2_nodemanager >> srcversion: FEA8BA1FCC9D61DAAF32077 >> >> Best regards, >> >> G. > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users