Hi William, Comments inline. Surya
On 06/20/15 02:36, William R Elliott wrote: > Hi, > We are using opensaf 4.4.0. We have a cluster that has 2 controllers and 3 > payloads. We had a situation where the osafamfnd was killed on the active > controller, the standby controller issued the following warnings in the > messages log in that second: > Jun 19 08:29:06 vervet osafimmd[25698]: WA IMMD lost contact with peer IMMD > (NCSMDS_RED_DOWN) > Jun 19 08:29:06 vervet osafimmnd[25713]: WA DISCARD DUPLICATE FEVS > message:10756 > Jun 19 08:29:06 vervet osafimmnd[25713]: WA Error code 2 returned for message > type 57 - ignoring > Jun 19 08:29:06 vervet osafimmd[25698]: WA IMMND DOWN on active controller f2 > detected at standby immd!! f1. Possible failover > Jun 19 08:29:06 vervet osafimmnd[25713]: WA DISCARD DUPLICATE FEVS > message:10757 > Jun 19 08:29:06 vervet osafimmnd[25713]: WA Error code 2 returned for message > type 57 - ignoring > Jun 19 08:29:06 vervet osafimmd[25698]: NO Skipping re-send of fevs message > 10756 since it has recently been resent. > Jun 19 08:29:06 vervet osafimmd[25698]: NO Skipping re-send of fevs message > 10757 since it has recently been resent. > Jun 19 08:29:06 vervet osafimmnd[25713]: NO Global discard node received for > nodeId:2020f pid:9609 > Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 9 <0, > 2020f(down)> (safLckService) > Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 8 <0, > 2020f(down)> (safEvtService) > Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 6 <0, > 2020f(down)> (safMsgGrpService) > > After those messages were issued, the standby controller and 3 payloads > continued to stay up for 3 minutes and then the following errors were in the > standby controller messages log: > Jun 19 08:32:06 vervet osafamfnd[25808]: ER AMF director unexpectedly crashed > Jun 19 08:32:06 vervet osafimmnd[25713]: NO No IMMD service => cluster > restart, exiting > Jun 19 08:32:06 vervet osafamfnd[25808]: Rebooting OpenSAF NodeId = 131343 EE > Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, > OwnNodeId = 131343, SupervisionTime = 0 > Jun 19 08:32:06 vervet osafamfd[25790]: NO Re-initializing with IMM > Jun 19 08:32:06 vervet opensaf_bounce: Bouncing local node; timeout= > Jun 19 08:32:06 vervet opensafd: Stopping OpenSAF Services > Jun 19 08:33:02 vervet osafamfwd[25932]: TIMEOUT receiving AMF health check > request, generating core for amfnd > Jun 19 08:33:02 vervet osafamfwd[25932]: Last received healthcheck cnt=105 at > Fri Jun 19 08:32:02 2015 > Jun 19 08:33:02 vervet osafamfwd[25932]: Rebooting OpenSAF NodeId = 0 EE Name > = No EE Mapped, Reason: AMFND unresponsive, AMFWDOG initiated system reboot, > OwnNodeId = 131343, SupervisionTime = 0 > > During this time, the osafdtmd and osafamfwd stayed up on the original active > controller. [Surya] As per the design, this node should go for reboot. Can you please check the messages of this machine. > > At this point the standby controller and all payloads rebooted. So my > questions are: > > 1) Why was there a delay of 3 minutes before anything happened? [Surya] There is a timer(NO-ACTIVE) which runs when there is no ACTIVE amfd in the cluster. Before this expires, amfd on the other node should become active. Once this expires, all the nodes in the cluster go for a reboot and thats what happend here. > > > 2) Why didn't the standby immediately take the active role, and keep > itself and the payloads up? I.e. why did the standby and payloads reboot? [Surya] The original node where amfnd was killed didnt went for reboot, thats the reason for this node(standby) not to become active. See 1 for why the nodes went for reboot. > > > > 3) A more basic question is, exactly how does the standby node know that > the active is dead and to take the active role? [Surya] dtmd present on each node keeps track of other nodes in the cluster and this way it knows whether the active is dead or not. > > > > > > I would greatly appreciate any help with these questions. > > > > thanks > > > > > > ________________________________ > The information transmitted herein is intended only for the person or entity > to which it is addressed and may contain confidential, proprietary and/or > privileged material. Any review, retransmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users -- Regards Suryanarayana Garlapati | Senior Solutions Architect GlobalLogic P +91.120.406.2257 M +91.874.400.0585 Skype surya_g10 www.globallogic.com http://www.globallogic.com/email_disclaimer.txt ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
