Hi- I have a 16 node cluster that has been rebooting all nodes in the cluster. I recevied a seg-fault from multipathd on one node and then all nodes in the cluster rebooted. Here is the error message that appeared on all nodes:
Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Node 8 leaves domain B24F4E67EBB34CAA99690B112FA6D50E Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Nodes in domain ("B24F4E67EBB34CAA99690B112FA6D50E"): 0 1 2 3 5 6 7 9 10 13 15 17 Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Node 8 leaves domain F575B164F63E4E888004C70D9F84D779 Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Nodes in domain ("F575B164F63E4E888004C70D9F84D779"): 0 1 2 3 5 6 7 9 10 13 15 16 17 Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Node 8 leaves domain A70D0DC186724FF388CDE65EC540C444 Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Nodes in domain ("A70D0DC186724FF388CDE65EC540C444"): 0 1 2 3 5 6 7 9 10 Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Node 8 leaves domain B31B07823153433C948F63199CE4A31C Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Nodes in domain ("B31B07823153433C948F63199CE4A31C"): 0 1 2 3 5 6 7 9 10 Mar 13 13:31:11 bws01 xinetd[4934]: START: nrpe pid=1065 from=10.10.8.20 Mar 13 13:31:11 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1065 duration=0(sec) Mar 13 13:32:27 bws01 kernel: o2net: connection to node bapp05 (num 8) at 10.10.16.15:7777 has been idle for 30.0 seconds, shutting it down. Mar 13 13:32:27 bws01 kernel: (0,0):o2net_idle_timer:1476 here are some times that might help debug the situation: (tmr 1236965517.208305 now 1236965547.207461 dr 1236965517.208295 adv 1236965517.208311:1236965517.208312 func (ee9d109e:513) 1236965445.298207:1236965445.298219) Mar 13 13:32:27 bws01 kernel: o2net: no longer connected to node bapp05 (num 8) at 10.10.16.15:7777 Mar 13 13:32:55 bws01 xinetd[4934]: START: nrpe pid=1068 from=10.10.8.20 Mar 13 13:32:55 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1068 duration=0(sec) Mar 13 13:32:57 bws01 kernel: (4586,0):o2net_connect_expired:1637 ERROR: no connection established with node 8 after 30.0 seconds, giving up and returning errors. Mar 13 13:33:00 bws01 kernel: (4586,0):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 8 Why would this cause all nodes in the cluster to reboot? Seems to me that it should have kicked out node 8 only... thanks Andrew _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users