Thanks. I will gather the information and file a bugzilla. On Fri, 13 Mar 2009, Sunil Mushran wrote:
> Impossible to determine the cause with what you have provided. File a > bugzilla and attach messages from all nodes. No exceptions. If you have > netconsole setup (you should) attach those logs. That way we'll know if > the nodes oopsed and if so what the stack was. > > Sunil > > On Fri, Mar 13, 2009 at 03:01:57PM -0400, and...@temporalspaces.com wrote: >> Hi- >> >> I have a 16 node cluster that has been rebooting all nodes in the cluster. >> I >> recevied a seg-fault from multipathd on one node and then all nodes in the >> cluster >> rebooted. Here is the error message that appeared on all nodes: >> >> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Node 8 leaves domain >> B24F4E67EBB34CAA99690B112FA6D50E >> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Nodes in domain >> ("B24F4E67EBB34CAA99690B112FA6D50E"): 0 1 2 3 5 6 7 9 10 13 15 17 >> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Node 8 leaves domain >> F575B164F63E4E888004C70D9F84D779 >> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Nodes in domain >> ("F575B164F63E4E888004C70D9F84D779"): 0 1 2 3 5 6 7 9 10 13 15 16 17 >> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Node 8 leaves domain >> A70D0DC186724FF388CDE65EC540C444 >> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Nodes in domain >> ("A70D0DC186724FF388CDE65EC540C444"): 0 1 2 3 5 6 7 9 10 >> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Node 8 leaves domain >> B31B07823153433C948F63199CE4A31C >> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Nodes in domain >> ("B31B07823153433C948F63199CE4A31C"): 0 1 2 3 5 6 7 9 10 >> Mar 13 13:31:11 bws01 xinetd[4934]: START: nrpe pid=1065 from=10.10.8.20 >> Mar 13 13:31:11 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1065 >> duration=0(sec) >> Mar 13 13:32:27 bws01 kernel: o2net: connection to node bapp05 (num 8) at >> 10.10.16.15:7777 has been idle for 30.0 seconds, shutting it down. >> Mar 13 13:32:27 bws01 kernel: (0,0):o2net_idle_timer:1476 here are some >> times that >> might help debug the situation: (tmr 1236965517.208305 now >> 1236965547.207461 dr >> 1236965517.208295 adv 1236965517.208311:1236965517.208312 func >> (ee9d109e:513) >> 1236965445.298207:1236965445.298219) >> Mar 13 13:32:27 bws01 kernel: o2net: no longer connected to node bapp05 >> (num 8) at >> 10.10.16.15:7777 >> Mar 13 13:32:55 bws01 xinetd[4934]: START: nrpe pid=1068 from=10.10.8.20 >> Mar 13 13:32:55 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1068 >> duration=0(sec) >> Mar 13 13:32:57 bws01 kernel: (4586,0):o2net_connect_expired:1637 ERROR: >> no >> connection established with node 8 after 30.0 seconds, giving up and >> returning >> errors. >> Mar 13 13:33:00 bws01 kernel: (4586,0):ocfs2_dlm_eviction_cb:98 device >> (253,0): dlm >> has evicted node 8 >> >> Why would this cause all nodes in the cluster to reboot? Seems to me that >> it should have kicked out node 8 only... >> >> thanks >> Andrew >> >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users@oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users