Were the first set of messages on all nodes? On that node atleast
the o2hb node down event fired. It should have fired on all nodes.
This is the dlm eviction message.

If they all fired, then look for a node to have a message that
reads "Node x is the Recovery Master for the Dead Node y".

That shows a node was elected to run the dlm recovery. That has
to complete before the journal is replayed. "Recovering node x
from slot y on device".

I did a quick scan of the patches since 2.6.28. They are a lot
of them. I did not see any fixes in this area.
git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2

Sunil

Gabriele Alberti wrote:
> Hello,
> I have a weird behavior in my ocfs2 cluster. I have few nodes
> accessing a shared device, and everything works fine as long as one
> node crashes for whatever reason. When this happens, the ocfs2
> filesystem hangs and it seems impossible to access it until I dont
> bring down all the nodes but one. I have a (commented) log of what
> happened few nights ago, when a node shut itself down because of a fan
> failure. In order to avoid uncontrolled re-joins to the cluster my
> nodes stay off when they go off for a reason.
>
> The log is available at http://pastebin.com/gDg577hH
>
> Is this the expected behavior? I thought when one node fails, the rest
> of the world should go on working after the timeout (I used default
> values for timeouts).
>
> Here are my versions
>
> # modinfo ocfs2
> filename:       /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko
> author:         Oracle
> license:        GPL
> description:    OCFS2 1.5.0
> version:        1.5.0
> vermagic:       2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS
> depends:        jbd2,ocfs2_stackglue,ocfs2_nodemanager
> srcversion:     FEA8BA1FCC9D61DAAF32077
>
> Best regards,
>
> G.

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to