[tickets] [opensaf:tickets] Re: #721 IMMD asserted when trying to become active during failover

Anders Bjornerstedt Fri, 17 Jan 2014 02:42:47 -0800

Hi Mathi,

In this test case it is the amfnd (==AVND I assume) that is killed.
So it goes down first.
That explains why this particular test is so provocative then.


The problem with removing this assert is that it is like opening a can of worms.
It means that I starting to try to support a use-case/failure-case that is a 
hybrid between fail-over and switch-over.

If I remove the assert then I would have return from the function without 
having elected new coord.
Instead continue to run with new active but immnd coord still at old SC.
Then when the IMMND sdown event finally arrives elect the new coord.
The problem here is that I have a hard time grasping how many other ssumptions 
I am breaking if I do this.

I have a counter proposal.
FM could subscribe to *both* amfnd down and immnd down and require that *both* 
have
occurred before proclaiming new active.
That would avoid the immnd problem.

But in general this is a bit hackish because the other services are left 
vulnerable to the hybrid failover/switchover state.

/AndersBj

________________________________
From: Mathi Naickan [mailto:[email protected]]
Sent: den 17 januari 2014 11:26
To: [opensaf:tickets]
Subject: [opensaf:tickets] #721 IMMD asserted when trying to become active 
during failover


Anders,

Some background information.
FM used to start failover processing upon receiving NODE_DOWN event.
>From 4.4 onwards FM subscribes to AVND down events to start failover 
>processing.
This is based on the fact that AVND is the last process to exit (barring AMFD) 
i.e. after all the components(middleware and application) have exited.
This change is to support the use case of opensafd stop without node reboot(or 
when non opensaf applications are still using tipc and rmmod is not done)

For OpenSAF services this translates to faster failover detection and 
processing than before.

As for this ticket and 722(same as this), the issue has not been reproducible 
again yet.

The issue of why there is a delayed(or none) service DOWN event has to be 
checked in parallel from MDS perspective, i shall track that part.

So, if we take one issue at a time, could you check if IMM can still be good by 
removing this assert? If that's possible(patch for the assert removal), it can 
very much go through the test cycle for RC1, to see if produces other side 
effects.

________________________________

[tickets:#721]<http://sourceforge.net/p/opensaf/tickets/721/> IMMD asserted 
when trying to become active during failover

Status: unassigned
Created: Thu Jan 16, 2014 07:32 AM UTC by Sirisha Alla
Last Updated: Fri Jan 17, 2014 08:37 AM UTC
Owner: nobody

The issue is seen on changeset 4733 + patches of CLM corresponding to 
changesets of #220. Continuous failovers are happening when some api 
invocations of IMM application are ongoing. The IMMD has asserted on the new 
active leading to cluster reset.

SC-1 is active and amfnd is killed to trigger a failover

Jan 15 18:23:03 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Ccb 35 COMMITTED (exowner)
Jan 15 18:23:07 SLES-64BIT-SLOT1 osafimmnd[2411]: NO implementer for class 
'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is 
UNSAFE
Jan 15 18:23:57 SLES-64BIT-SLOT1 sshd[3010]: Accepted keyboard-interactive/pam 
for root from 192.168.56.103 port 60396 ssh2
Jan 15 18:23:59 SLES-64BIT-SLOT1 root: killing osafamfnd from invoke_failover.sh
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmd[2455]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafntfd[2441]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafevtd[2609]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafckptd[2600]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osaflogd[2421]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafrded[2382]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmna[2465]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafimmd[2401]: AL AMF Node Director is down, 
terminate this process

SC-2 tried to become active but IMMD asserted leading to cluster reset

Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Peer FM down on node_id: 
131343
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS 
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS 
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafrded[2616]: NO rde_rde_set_role: role set 
to 1
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaflogd[2654]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafntfd[2667]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfd[2700]: NO FAILOVER StandBy --> Active
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ellect_coord invoke from 
lga_callback ACTIVE
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: ER Changing IMMND coord while 
old coord is still up!
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: immd_proc.c:297: 
immd_proc_elect_coord: Assertion 'immnd_info_node->immnd_key == cb->node_id' 
failed.
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Director Service in 
NOACTIVE state - fevs replies pending:2 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: NO 
'safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: ER 
safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER clms_mds_msg_send FAILED: 2
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER 
clms_clma_api_msg_dispatcher FAILED: type 0
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: NO No IMMD service => cluster 
restart

Attached the logs with IMMD traces

________________________________

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/opensaf/tickets/721/

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/



---

** [tickets:#721] IMMD asserted when trying to become active during failover**

**Status:** unassigned
**Created:** Thu Jan 16, 2014 07:32 AM UTC by Sirisha Alla
**Last Updated:** Fri Jan 17, 2014 10:26 AM UTC
**Owner:** nobody

The issue is seen on changeset 4733 + patches of CLM corresponding to 
changesets of #220. Continuous failovers are happening when some api 
invocations of IMM application are ongoing. The IMMD has asserted on the new 
active leading to cluster reset.

SC-1 is active and amfnd is killed to trigger a failover

Jan 15 18:23:03 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Ccb 35 COMMITTED (exowner)
Jan 15 18:23:07 SLES-64BIT-SLOT1 osafimmnd[2411]: NO implementer for class 
'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is 
UNSAFE
Jan 15 18:23:57 SLES-64BIT-SLOT1 sshd[3010]: Accepted keyboard-interactive/pam 
for root from 192.168.56.103 port 60396 ssh2
Jan 15 18:23:59 SLES-64BIT-SLOT1 root: killing osafamfnd from invoke_failover.sh
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmd[2455]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafntfd[2441]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafevtd[2609]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafckptd[2600]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osaflogd[2421]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafrded[2382]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmna[2465]: AL AMF Node Director is down, 
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafimmd[2401]: AL AMF Node Director is down, 
terminate this process

SC-2 tried to become active but IMMD asserted leading to cluster reset

Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Peer FM down on node_id: 
131343
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS 
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS 
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafrded[2616]: NO rde_rde_set_role: role set 
to 1
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaflogd[2654]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafntfd[2667]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfd[2700]: NO FAILOVER StandBy --> Active
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ellect_coord invoke from 
lga_callback ACTIVE
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: ER Changing IMMND coord while 
old coord is still up!
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: immd_proc.c:297: 
immd_proc_elect_coord: Assertion 'immnd_info_node->immnd_key == cb->node_id' 
failed.
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Director Service in 
NOACTIVE state - fevs replies pending:2 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: NO 
'safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: ER 
safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER clms_mds_msg_send FAILED: 2
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER 
clms_clma_api_msg_dispatcher FAILED: type 0
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: NO No IMMD service => cluster 
restart


Attached the logs with IMMD traces


---

Sent from sourceforge.net because [email protected] is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #721 IMMD asserted when trying to become active during failover

Reply via email to