The direct cause of the IMMND assert appears to be that MDS is slower in
detecting and generating IMMND down (service callback) than FM is in
triggering new ACTIVE.
The system is using MDS over TIPC, not MDS over TCP.
This surprised me as my initial though was that the slowness of MDS
in generating service callbacks would be due to MDS over TCP.
But its MDS over TIPC which means that the TIPC link tolerance apparently
is larger than the fail-over time as managed by FM, at least in this case.
We end up with services at SC2 becoming active while they stilll
effectively have MDS communication open with their service peers at SC1.
This is possibly not a true split brain scenario, but it is an apparent
split brain scenario. Even if the peer processes on old active have in
reality and realtime been terminated, the messages they sent just before
termination may still be in transit (in buffers) and may arrive at new
active.
I have not seen this scenario before and I dont know how to handle this
or even if it is a scenario that can be handled.
I could just remove the assert, but that would be dangerous and cause
even more complicated error cases due to not being fail-fast.
My suggestion is that FM also needs to subscribe to MDS to detect when MDS
is down towards old-active and to postpone invoking new active until it
has confirmed that MDS is down towards old active, for failover.
I assume FM is not involved in switchover, other than as a recipient
or observer.
This was tested with 2PBE but I dont see how it can have anything to do with
that.
Important questions are also:
1) Has this same test regularly been executed this way before, or is it
a new test?
2) Why do we see this now? I.e. why the apparent slowness of TIPC now.
As I understood it we have made some efforts to improve TIPC performance
in general. Perhaps improvements in TIPC capacity (avoiding message loss)
has resulted in a practical increase in the link-tolerance ?
---
** [tickets:#721] IMMD asserted when trying to become active during failover**
**Status:** accepted
**Created:** Thu Jan 16, 2014 07:32 AM UTC by Sirisha Alla
**Last Updated:** Thu Jan 16, 2014 08:58 AM UTC
**Owner:** Anders Bjornerstedt
The issue is seen on changeset 4733 + patches of CLM corresponding to
changesets of #220. Continuous failovers are happening when some api
invocations of IMM application are ongoing. The IMMD has asserted on the new
active leading to cluster reset.
SC-1 is active and amfnd is killed to trigger a failover
Jan 15 18:23:03 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Ccb 35 COMMITTED (exowner)
Jan 15 18:23:07 SLES-64BIT-SLOT1 osafimmnd[2411]: NO implementer for class
'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is
UNSAFE
Jan 15 18:23:57 SLES-64BIT-SLOT1 sshd[3010]: Accepted keyboard-interactive/pam
for root from 192.168.56.103 port 60396 ssh2
Jan 15 18:23:59 SLES-64BIT-SLOT1 root: killing osafamfnd from invoke_failover.sh
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmd[2455]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafntfd[2441]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafevtd[2609]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafckptd[2600]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osaflogd[2421]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafrded[2382]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafclmna[2465]: AL AMF Node Director is down,
terminate this process
Jan 15 18:23:59 SLES-64BIT-SLOT1 osafimmd[2401]: AL AMF Node Director is down,
terminate this process
SC-2 tried to become active but IMMD asserted leading to cluster reset
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Peer FM down on node_id:
131343
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: NO Role: STANDBY, Node Down for
node id: 2010f
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaffmd[2625]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA DISCARD DUPLICATE FEVS
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Error code 2 returned for
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafrded[2616]: NO rde_rde_set_role: role set
to 1
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osaflogd[2654]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafntfd[2667]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: NO ACTIVE request
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfd[2700]: NO FAILOVER StandBy --> Active
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: NO ellect_coord invoke from
lga_callback ACTIVE
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: ER Changing IMMND coord while
old coord is still up!
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmd[2635]: immd_proc.c:297:
immd_proc_elect_coord: Assertion 'immnd_info_node->immnd_key == cb->node_id'
failed.
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: WA Director Service in
NOACTIVE state - fevs replies pending:2 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: NO
'safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: ER
safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafamfnd[2714]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Jan 15 18:24:01 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER clms_mds_msg_send FAILED: 2
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafclmd[2681]: ER
clms_clma_api_msg_dispatcher FAILED: type 0
Jan 15 18:24:01 SLES-64BIT-SLOT2 osafimmnd[2645]: NO No IMMD service => cluster
restart
Attached the logs with IMMD traces
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets