The following patches ensure the problem is fixed in the mentioned scenario.
However, as additional follow-up, the below link refers an additional patch 
review request pending before TLC to cover a scenario involving a slow OS 
reboot cycle. Will close the ticket upon a close of 
https://sourceforge.net/p/opensaf/mailman/opensaf-devel/thread/patchbomb.1391021721%40dhcp-hyd-scp-5fl-10-176-178-129.in.oracle.com/#msg31901496.
 The ticket is marked as minor now.

branch:      opensaf-4.4.x
parent:      4887:f6f2cbd1cc13
user:        Mathivanan N.P.<[email protected]>
date:        Tue Feb 04 14:22:22 2014 +0530
summary:     fm: failover using OPENSAF_MANAGE_TIPC flag andsubscribe to AMF, 
IMM downs [#721]

changeset:   4892:3862059258db
branch:      opensaf-4.4.x
user:        Mathivanan N.P.<[email protected]>
date:        Tue Feb 04 14:24:57 2014 +0530
summary:     fm: install ava_install_amf_down_cb and wait till OSreboot 
terminates FM [#721]

changeset:   4893:9f799d27d5bb
parent:      4888:b5fadff61e11
user:        Mathivanan N.P.<[email protected]>
date:        Tue Feb 04 14:25:38 2014 +0530
summary:     fm: failover using OPENSAF_MANAGE_TIPC flag andsubscribe to AMF, 
IMM downs [#721]

changeset:   4894:c8001d5ecf44
tag:         tip
user:        Mathivanan N.P.<[email protected]>
date:        Tue Feb 04 14:26:16 2014 +0530
summary:     fm: install ava_install_amf_down_cb and wait till OSreboot 
terminates FM [#721]



[staging:65ea41]
[staging:386205]
[staging:9f799d]
[staging:c8001d]


---

** [tickets:#721] Support the case when amfnd is killed when opensaf doesn't 
controls TIPC**

**Status:** review
**Created:** Thu Jan 16, 2014 07:32 AM UTC by Sirisha Alla
**Last Updated:** Wed Jan 29, 2014 07:18 PM UTC
**Owner:** Mathi Naickan

The issue is seen on changeset 4733 + patches of CLM corresponding to 
changesets of #220. Continuous failovers are happening when some api 
invocations of IMM application are ongoing. The IMMD has asserted on the new 
active leading to cluster reset.

SC-1 is active and amfnd is killed to trigger a failover

Jan 15 18:23:03 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Ccb 35 COMMITTED (exowner)
Jan 15 18:23:07 SLES-64BIT-SLOT1 osafimmnd[2411]: NO implementer for class 
'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is 
UNSAFE
Jan 15 18:23:57 SLES-64BIT-SLOT1 sshd[3010]: Accepted keyboard-interactive/pam 
for root from 192.168.56.103 port 60396 ssh2
Jan 15 18:23:59 SLES-64BIT-SLOT1 root: killing osafamfnd from invoke_failover.sh
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmd[2455]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafntfd[2441]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafevtd[2609]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafckptd[2600]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osaflogd[2421]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafrded[2382]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmna[2465]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafimmd[2401]: AL AMF Node Director is down, 
terminate this process

SC-2 tried to become active but IMMD asserted leading to cluster reset

Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Peer FM down on node_id: 
131343
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS 
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS 
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafrded[2616]: NO rde_rde_set_role: role set 
to 1
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaflogd[2654]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafntfd[2667]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfd[2700]: NO FAILOVER StandBy --> Active
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ellect_coord invoke from 
lga_callback ACTIVE
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: ER Changing IMMND coord while 
old coord is still up!
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: immd_proc.c:297: 
immd_proc_elect_coord: Assertion 'immnd_info_node->immnd_key == cb->node_id' 
failed.
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Director Service in 
NOACTIVE state - fevs replies pending:2 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: NO 
'safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: ER 
safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER clms_mds_msg_send FAILED: 2
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER 
clms_clma_api_msg_dispatcher FAILED: type 0
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: NO No IMMD service => cluster 
restart


Attached the logs with IMMD traces


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to