|
Hello, I am running a 4-node ocfs2 cluster. Our servers are running Redhat AS4, kernel
2.6.9-34.0.1.ELhugemem. Our ocfs2 package versions are: # rpm -qa | grep ocfs2 ocfs2-tools-debuginfo-1.2.1-1 ocfs2-tools-1.2.1-1 ocfs2-2.6.9-34.0.1.ELhugemem-1.2.2-1 ocfs2console-1.2.1-1 One of the nodes (#3) crashed. We’re rebooted node 3,
but now it hangs as it tries to rejoin the cluster. On two of the nodes that are up (0 and 1), I am getting
messages repeated /var/log/messages that look like this: Jul 19 09:39:40 radon6 kernel:
(3994,2):dlm_query_join_handler:614 node 3 trying to join, but recovery is
ongoing. Jul 19 09:39:50 radon6 last message repeated 25 times Jul 19 09:39:51 radon6 kernel:
(27704,1):dlm_get_lock_resource:895 46A341FD43114DE4A10E7D63C5099461:M0000000000000000667f6c991b8fc9:
at least one node (3) torecover before lock mastery can begin Jul 19 09:39:51 radon6 kernel:
(3994,2):dlm_query_join_handler:614 node 3 trying to join, but recovery is
ongoing. Jul 19 09:39:51 radon6 kernel: (10183,1):dlm_get_lock_resource:895
46A341FD43114DE4A10E7D63C5099461:M00000000000000000081e17e89ae74: at least one
node (3) torecover before lock mastery can begin Jul 19 09:39:51 radon6 kernel:
(3994,2):dlm_query_join_handler:614 node 3 trying to join, but recovery is
ongoing. This appears to be in an infinite loop and node 3 never
starts. I’m not seeing the messages on node 2. The cluster is up and running on 3 of the 4 servers, but I
need to get all 4 nodes running again. Can anyone provide any insight on what is going on or how
this should be handled? Thanks! Oracle Applications DBA [EMAIL PROTECTED] |
_______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
