[tickets] [opensaf:tickets] #2158 AMF: IMMND dies at Opensaf start up phase causes AMFD heartbeat timeout

Minh Hon Chau Mon, 14 Nov 2016 20:12:07 -0800

I have tried 2 runs, and seems to see the reported problem, although the reason 
of node rebooting are different. 
Changeset: 8313:fa6b413693e3
Steps:
- Start cluster: SC1, SC2, PL3, PL4
- Stop SCs
- Restart SCs
- Reboot SC1 while SC2 has not been promoted to standby
- Meanwhile, kill immnd in SC2
Observations: Services can't initialize with IMM since IMMND is not restarted, 
AMFND has not initiated IMMND component


Run #1:
2016-11-15 14:31:57 SC-2 osafimmnd[431]: exiting for shutdown
2016-11-15 14:31:57 SC-2 osafimmd[421]: NO New IMMND process is on ACTIVE 
Controller at 2020f
2016-11-15 14:31:57 SC-2 osafimmd[421]: NO MDS event from svc_id 25 (change:4, 
dest:565213401186735)
2016-11-15 14:31:57 SC-2 osafntfimcnd[487]: NO saImmOiDispatch() Fail 
SA_AIS_ERR_BAD_HANDLE (9)
2016-11-15 14:31:57 SC-2 osafimmd[421]: NO Successfully announced dump at node 
2030f. New Epoch:116
2016-11-15 14:31:57 SC-2 osafimmd[421]: NO ACT: New Epoch for IMMND process at 
node 2040f old epoch: 115  new epoch:116
2016-11-15 14:32:07 SC-2 osafamfd[467]: WA saClmInitialize_4 returned 5
2016-11-15 14:32:07 SC-2 osafamfnd[476]: WA saClmInitialize_4 returned 5
2016-11-15 14:32:07 SC-2 osaflogd[440]: ER saClmInitialize failed with error: 5
2016-11-15 14:32:07 SC-2 osafntfd[449]: ER saClmInitialize failed with error: 5
2016-11-15 14:32:07 SC-2 osafclmd[458]: safNode=PL-4,safCluster=myClmCluster 
JOINED, init view=2, cluster view=2
2016-11-15 14:32:12 SC-2 osafclmd[458]: ER saNtfNotificationSend() returned: 
SA_AIS_ERR_TRY_AGAIN (6)
2016-11-15 14:31:54 SC-2 osafclmna[396]: NO Starting to promote this node to a 
system controller
2016-11-15 14:32:12 SC-2 osafclmna[396]: NO 
safNode=SC-2,safCluster=myClmCluster Joined cluster, nodeid=2020f
2016-11-15 14:32:12 SC-2 osafclmd[458]: safNode=SC-2,safCluster=myClmCluster 
JOINED, init view=3, cluster view=3
2016-11-15 14:32:17 SC-2 osafclmd[458]: ER saNtfNotificationSend() returned: 
SA_AIS_ERR_TRY_AGAIN (6)
2016-11-15 14:32:19 SC-2 osafclmd[458]: NO proc_initialize_msg: send failed. 
dest:2020f000001c1
2016-11-15 14:32:20 SC-2 osafclmd[458]: NO proc_initialize_msg: send failed. 
dest:2020f000001b8
2016-11-15 14:32:17 SC-2 osafamfnd[476]: WA saClmInitialize_4 returned 5
2016-11-15 14:32:50 SC-2 osafamfnd[476]: CR saImmOmInitialize failed. Use 
previous value of nodeName.
2016-11-15 14:32:17 SC-2 osafamfd[467]: WA saClmInitialize_4 returned 5
2016-11-15 14:35:06 SC-2 osafamfd[467]: ER saImmOmSearchInitialize_2 failed: 6
2016-11-15 14:35:06 SC-2 osafamfd[467]: ER Failed to read configuration, AMF 
will not start
2016-11-15 14:35:06 SC-2 osafamfd[467]: ER avd_imm_config_get FAILED
2016-11-15 14:35:06 SC-2 osafamfd[467]: ER avd_active_role_initialization FAILED
2016-11-15 14:35:06 SC-2 osafamfd[467]: ER initialize_for_assignment FAILED 2
2016-11-15 14:35:06 SC-2 osafamfnd[476]: WA AMF director unexpectedly crashed
2016-11-15 14:35:06 SC-2 osafamfnd[476]: WA AMF director unexpectedly crashed
2016-11-15 14:35:06 SC-2 osafamfnd[476]: Rebooting OpenSAF NodeId = 131599 EE 
Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, 
OwnNodeId = 131599, SupervisionTime = 60
2016-11-15 14:35:06 SC-2 opensaf_reboot: Rebooting local node; timeout=60

Run #2:
2016-11-15 14:45:00 SC-2 osafimmnd[430]: exiting for shutdown
2016-11-15 14:45:00 SC-2 osafimmd[420]: NO MDS event from svc_id 25 (change:4, 
dest:565213401186734)
2016-11-15 14:45:00 SC-2 osafntfimcnd[486]: NO saImmOiDispatch() Fail 
SA_AIS_ERR_BAD_HANDLE (9)
2016-11-15 14:45:00 SC-2 osafamfnd[475]: NO saClmDispatch BAD_HANDLE
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO Sc Absence Allowed is configured 
(900) => IMMND coord at payload node:2030f dest566312912814476
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO Node 2010f request sync sync-pid:430 
epoch:0 
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO Successfully announced sync. New 
ruling epoch:135
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO ACT: New Epoch for IMMND process at 
node 2040f old epoch: 134  new epoch:135
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO ACT: New Epoch for IMMND process at 
node 2010f old epoch: 0  new epoch:135
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO ACT: New Epoch for IMMND process at 
node 2030f old epoch: 134  new epoch:135
2016-11-15 14:45:01 SC-2 osafimmd[420]: NO ACT: SaImmRepositoryInitModeT 
changed and noted as being: SA_IMM_KEEP_REPOSITORY
2016-11-15 14:45:02 SC-2 osafimmd[420]: NO Successfully announced dump at node 
2030f. New Epoch:136
2016-11-15 14:45:02 SC-2 osafimmd[420]: NO ACT: New Epoch for IMMND process at 
node 2040f old epoch: 135  new epoch:136
2016-11-15 14:45:02 SC-2 osafimmd[420]: NO ACT: New Epoch for IMMND process at 
node 2010f old epoch: 135  new epoch:136
TEST: SC failover
2016-11-15 14:45:04 SC-2 osafdtmd[382]: NO Lost contact with 'SC-1'
2016-11-15 14:45:04 SC-2 osafimmd[420]: NO MDS event from svc_id 25 (change:4, 
dest:564113889558958)
2016-11-15 14:45:04 SC-2 osafclmd[457]: NO Node 131343 went down. Not sending 
track callback for agents on that node
2016-11-15 14:45:31 SC-2 osafamfnd[475]: CR saImmOmInitialize failed. Use 
previous value of nodeName.
2016-11-15 14:47:40 SC-2 osafamfnd[475]: ER AMF director heart beat timeout, 
generating core for amfd
2016-11-15 14:47:41 SC-2 osafamfnd[475]: Rebooting OpenSAF NodeId = 131599 EE 
Name = , Reason: AMF director heart beat timeout, OwnNodeId = 131599, 
SupervisionTime = 60
2016-11-15 14:47:41 SC-2 opensaf_reboot: Rebooting local node; timeout=60


---

** [tickets:#2158] AMF: IMMND dies at Opensaf start up phase causes AMFD 
heartbeat timeout**

**Status:** unassigned
**Milestone:** 5.0.2
**Created:** Wed Nov 02, 2016 05:20 AM UTC by Minh Hon Chau
**Last Updated:** Fri Nov 11, 2016 11:11 AM UTC
**Owner:** nobody
**Attachments:**

- 
[osafamfnd_sc2](https://sourceforge.net/p/opensaf/tickets/2158/attachment/osafamfnd_sc2)
 (264.2 kB; application/octet-stream)


If IMMND dies at Opensaf startup phase, IMMND is not restarted by AMF. The 
issue has been observed in following situation
- Restart cluster
- During active controller starts up, a critical component is death which cause 
a node failfast
Oct 25 12:51:21 SC-1 osafamfnd[7642]: ER 
safComp=ABC,safSu=1,safSg=2N,safApp=ABC Faulted due to:csiSetcallbackTimeout 
Recovery is:nodeFailfast
Oct 25 12:51:21 SC-1 osafamfnd[7642]: Rebooting OpenSAF NodeId = 131343 EE Name 
= , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, 
SupervisionTime = 60
- In the meantime, standby controller is requested to become active
Oct 25 12:51:27 SC-2 tipclog[16221]: Lost link <1.1.2:eth0-1.1.1:eth0> on 
network plane A
Oct 25 12:51:27 SC-2 osafclmna[4336]: NO Starting to promote this node to a 
system controller
Oct 25 12:51:27 SC-2 osafrded[4387]: NO Requesting ACTIVE role
- IMMND is also death a bit later
Oct 25 12:51:29 SC-2 osafimmnd[4536]: ER MESSAGE:44816 OUT OF ORDER my highest 
processed:44814 - exiting
Oct 25 12:51:29 SC-2 osafamfnd[7414]: NO saClmDispatch BAD_HANDLE
- Other services could not initialize other services since IMMND is death
Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:51:39 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:51:39 SC-2 osafclmd[7386]: WA saImmOiImplementerSet returned 9
Oct 25 12:51:39 SC-2 osafntfd[7372]: WA saLogInitialize returns try again, 
retries...
Oct 25 12:51:39 SC-2 osaflogd[7358]: WA saImmOiImplementerSet returned 
SA_AIS_ERR_BAD_HANDLE (9)
Oct 25 12:51:39 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:51:49 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:51:50 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:51:50 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:52:00 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:52:20 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:52:20 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:52:20 SC-2 osafimmd[4489]: NO Extended intro from node 2210f

- At the end, AMFD heart beat timeout 
Oct 25 12:53:57 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:54:01 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:54:07 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:54:11 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:54:15 SC-2 osafamfnd[7414]: ER AMF director heart beat timeout, 
generating core for amfd

In AMFND trace in SC2, AMFND did not receive su_pres from AMFD, therefore AMFND 
could not initiate middleware components (including IMMND), so AMFND was not 
aware of IMMND's death so that AMFND can restart IMMND. The problem here is 
slightly different from #1828, which happened in newly promoted SC (with 
roamingSC feature) where AMFND had IMMND registered.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2158 AMF: IMMND dies at Opensaf start up phase causes AMFD heartbeat timeout

Reply via email to