Hi Minh,

Is this issue observed in normal cluster (SC absecnce and Spare SC features 
disabled)?
In normal cluster, in failover situation FM makes standby SC active. Since 
other controller is still not standby, FM should reboot it as it did in #1334.

Also even if AMFND on standby SC receives su_pres message from AMFD, it will 
not be able to read the comp configuration as IMMND is down. 

Thanks,
Praveen


---

** [tickets:#2158] AMF: IMMND dies at Opensaf start up phase causes AMFD 
heartbeat timeout**

**Status:** unassigned
**Milestone:** 5.0.2
**Created:** Wed Nov 02, 2016 05:20 AM UTC by Minh Hon Chau
**Last Updated:** Wed Nov 02, 2016 05:20 AM UTC
**Owner:** nobody
**Attachments:**

- 
[osafamfnd_sc2](https://sourceforge.net/p/opensaf/tickets/2158/attachment/osafamfnd_sc2)
 (264.2 kB; application/octet-stream)


If IMMND dies at Opensaf startup phase, IMMND is not restarted by AMF. The 
issue has been observed in following situation
- Restart cluster
- During active controller starts up, a critical component is death which cause 
a node failfast
Oct 25 12:51:21 SC-1 osafamfnd[7642]: ER 
safComp=ABC,safSu=1,safSg=2N,safApp=ABC Faulted due to:csiSetcallbackTimeout 
Recovery is:nodeFailfast
Oct 25 12:51:21 SC-1 osafamfnd[7642]: Rebooting OpenSAF NodeId = 131343 EE Name 
= , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, 
SupervisionTime = 60
- In the meantime, standby controller is requested to become active
Oct 25 12:51:27 SC-2 tipclog[16221]: Lost link <1.1.2:eth0-1.1.1:eth0> on 
network plane A
Oct 25 12:51:27 SC-2 osafclmna[4336]: NO Starting to promote this node to a 
system controller
Oct 25 12:51:27 SC-2 osafrded[4387]: NO Requesting ACTIVE role
- IMMND is also death a bit later
Oct 25 12:51:29 SC-2 osafimmnd[4536]: ER MESSAGE:44816 OUT OF ORDER my highest 
processed:44814 - exiting
Oct 25 12:51:29 SC-2 osafamfnd[7414]: NO saClmDispatch BAD_HANDLE
- Other services could not initialize other services since IMMND is death
Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:51:39 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:51:39 SC-2 osafclmd[7386]: WA saImmOiImplementerSet returned 9
Oct 25 12:51:39 SC-2 osafntfd[7372]: WA saLogInitialize returns try again, 
retries...
Oct 25 12:51:39 SC-2 osaflogd[7358]: WA saImmOiImplementerSet returned 
SA_AIS_ERR_BAD_HANDLE (9)
Oct 25 12:51:39 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:51:49 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:51:50 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:51:50 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:52:00 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:52:20 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:52:20 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:52:20 SC-2 osafimmd[4489]: NO Extended intro from node 2210f

- At the end, AMFD heart beat timeout 
Oct 25 12:53:57 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:54:01 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:54:07 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:54:11 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:54:15 SC-2 osafamfnd[7414]: ER AMF director heart beat timeout, 
generating core for amfd

In AMFND trace in SC2, AMFND did not receive su_pres from AMFD, therefore AMFND 
could not initiate middleware components (including IMMND), so AMFND was not 
aware of IMMND's death so that AMFND can restart IMMND. The problem here is 
slightly different from #1828, which happened in newly promoted SC (with 
roamingSC feature) where AMFND had IMMND registered.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to