Hi,
We are using opensaf 4.4.0.  We have a cluster that has 2 controllers and 3 
payloads. We had a situation where the osafamfnd was killed on the active 
controller, the standby controller issued the following warnings in the 
messages log in that second:
Jun 19 08:29:06 vervet osafimmd[25698]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)
Jun 19 08:29:06 vervet osafimmnd[25713]: WA DISCARD DUPLICATE FEVS message:10756
Jun 19 08:29:06 vervet osafimmnd[25713]: WA Error code 2 returned for message 
type 57 - ignoring
Jun 19 08:29:06 vervet osafimmd[25698]: WA IMMND DOWN on active controller f2 
detected at standby immd!! f1. Possible failover
Jun 19 08:29:06 vervet osafimmnd[25713]: WA DISCARD DUPLICATE FEVS message:10757
Jun 19 08:29:06 vervet osafimmnd[25713]: WA Error code 2 returned for message 
type 57 - ignoring
Jun 19 08:29:06 vervet osafimmd[25698]: NO Skipping re-send of fevs message 
10756 since it has recently been resent.
Jun 19 08:29:06 vervet osafimmd[25698]: NO Skipping re-send of fevs message 
10757 since it has recently been resent.
Jun 19 08:29:06 vervet osafimmnd[25713]: NO Global discard node received for 
nodeId:2020f pid:9609
Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 9 <0, 
2020f(down)> (safLckService)
Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 8 <0, 
2020f(down)> (safEvtService)
Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 6 <0, 
2020f(down)> (safMsgGrpService)

After those messages were issued, the standby controller and 3 payloads 
continued to stay up for 3 minutes and then the following errors were in the 
standby controller messages log:
Jun 19 08:32:06 vervet osafamfnd[25808]: ER AMF director unexpectedly crashed
Jun 19 08:32:06 vervet osafimmnd[25713]: NO No IMMD service => cluster restart, 
exiting
Jun 19 08:32:06 vervet osafamfnd[25808]: Rebooting OpenSAF NodeId = 131343 EE 
Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, 
OwnNodeId = 131343, SupervisionTime = 0
Jun 19 08:32:06 vervet osafamfd[25790]: NO Re-initializing with IMM
Jun 19 08:32:06 vervet opensaf_bounce: Bouncing local node; timeout=
Jun 19 08:32:06 vervet opensafd: Stopping OpenSAF Services
Jun 19 08:33:02 vervet osafamfwd[25932]: TIMEOUT receiving AMF health check 
request, generating core for amfnd
Jun 19 08:33:02 vervet osafamfwd[25932]: Last received healthcheck cnt=105 at 
Fri Jun 19 08:32:02 2015
Jun 19 08:33:02 vervet osafamfwd[25932]: Rebooting OpenSAF NodeId = 0 EE Name = 
No EE Mapped, Reason: AMFND unresponsive, AMFWDOG initiated system reboot, 
OwnNodeId = 131343, SupervisionTime = 0

During this time, the osafdtmd and osafamfwd stayed up on the original active 
controller.

At this point the standby controller and all payloads rebooted. So my questions 
are:

1)      Why was there a delay of 3 minutes before anything happened?


2)      Why didn't the standby immediately take the active role, and keep 
itself and the payloads up?  I.e. why did the standby and payloads reboot?



3)      A more basic question is, exactly how does the standby node know that 
the active is dead and to take the active role?





I would greatly appreciate any help with these questions.



thanks





________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to