- **status**: review --> fixed
- **Comment**:
default (5.2) [staging:161ad8]
changeset: 8125:161ad8e1eb9b
user: Hung Nguyen <[email protected]>
date: Thu Sep 22 18:48:24 2016 +0700
summary: imm: Dont allow standby IMMD to send fevs if active IMMD is still
up [#2029]
opensaf-5.1.x [staging:d84ede]
changeset: 8126:d84ede934bf9
user: Hung Nguyen <[email protected]>
date: Thu Sep 22 18:48:24 2016 +0700
summary: imm: Dont allow standby IMMD to send fevs if active IMMD is still
up [#2029]
opensaf-5.0.x [staging:1de73f]
changeset: 8127:1de73f282bcc
user: Hung Nguyen <[email protected]>
date: Thu Sep 22 18:48:24 2016 +0700
summary: imm: Dont allow standby IMMD to send fevs if active IMMD is still
up [#2029]
opensaf-4.7.x [staging:2601d5]
changeset: 8128:2601d541c55e
user: Hung Nguyen <[email protected]>
date: Thu Sep 22 18:48:24 2016 +0700
summary: imm: Dont allow standby IMMD to send fevs if active IMMD is still
up [#2029]
---
** [tickets:#2029] imm: fevs message lost during failover**
**Status:** fixed
**Milestone:** 4.7.2
**Created:** Tue Sep 13, 2016 11:05 AM UTC by Hung Nguyen
**Last Updated:** Thu Sep 22, 2016 11:42 AM UTC
**Owner:** Hung Nguyen
**Attachments:**
- [logs.7z](https://sourceforge.net/p/opensaf/tickets/2029/attachment/logs.7z)
(250.9 kB; application/octet-stream)
There's fevs message loss when failing over between 2 SCs.
</br>
~~~
Sep 8 11:50:00 SC-2-1 osafimmnd[4241]: NO Implementer locally disconnected.
Marking it as doomed 232 <754, 2010f> (@OpenSafImmPBE)
Sep 8 11:50:00 SC-2-1 osafimmnd[4241]: NO Implementer locally disconnected.
Marking it as doomed 233 <755, 2010f> (OsafImmPbeRt_B)
...
Sep 8 11:50:00 SC-2-1 osafimmnd[4241]: NO Implementer disconnected 233 <755,
2010f> (OsafImmPbeRt_B)
~~~
</br>
The IMMNDs never receive the D2ND_DISCARD_IMPL for @OpenSafImmPBE, so that
applier keeps being mark as dying
</br>
~~~
Sep 8 11:50:02 SC-2-1 osafimmnd[4241]: NO ImmModel::getPbeBSlave reports
missing PbeBSlave locally => unsafe
Sep 8 11:50:03 SC-2-1 osafimmnd[4241]: NO ImmModel::getPbeBSlave reports
missing PbeBSlave locally => unsafe
Sep 8 11:50:04 SC-2-1 osafimmnd[4241]: NO ImmModel::getPbeBSlave reports
missing PbeBSlave locally => unsafe
...
Sep 8 11:59:08 SC-2-1 osafimmnd[4241]: NO ImmModel::getPbeBSlave reports
missing PbeBSlave locally => unsafe
Sep 8 11:59:09 SC-2-1 osafimmnd[4241]: NO ImmModel::getPbeBSlave reports
missing PbeBSlave locally => unsafe
Sep 8 11:59:10 SC-2-1 osafimmnd[4241]: NO ImmModel::getPbeBSlave reports
missing PbeBSlave locally => unsafe
...
~~~
</br>
The main problem is the standby IMMD also broadcast D2ND_DISCARD_NODE message
when it receives an NCSMDS_DOWN from IMMND. See immd_process_immnd_down().
If the NCSMDS_DOWN event comes to the 2 IMMDs at the same time, the 2
D2ND_DISCARD_NODE messages will be stamped with the same number. One of the 2
will be discarded by IMMNDs, no problem here.
But if there's a latency of NCSMDS_DOWN event, an other fevs message (in this
case it's D2ND_DISCARD_IMPL for @OpenSafImmPBE) will be discarded by IMMNDs,
that will cause fevs message loss.
Details of the problem is explained here
</br>
http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICSBZdARAjACgGUAVAIQEoAoUSWeEJNTLAJjwEEBhIy66ORFBnQA5LAGcKFMACMA9gA9ksgG4BTMI2z5i5ADRCW7LmQBcMAK5gwqhgDNVysQRsATDrPNITyHAAYALADMAGySMgpKahpComImMVhKCMgEHMzIUGLILrIA7ghhcooq6pq4hKSmwhwE2AQA+lgA8gDqwsgONhCSkgbalQC0AMTWLgB8CXEsoo2oqWwASlj1wk1YAKLIYhAgALbAqi7IuVAQABY+AYEA7L1MrJzcADxD0gA25qoDkyaizMtYOYcRbLDAABQAMshbLINAABJoHBAEEC2VC7XZgkjrQoRErRe5GbgmeyOZwINweBiZZAIWQoczAFwgCCHZAQWQAc1U51KJ3OZRwAB0ECKxLIMihtntgFlechdqoxGIQNzjqcLn4grcKAYHsZhqMJphYiZpgCgSD6uCodL9mz+Zqrjq+hVyC9OdYbN9CY9TOgSBwxMoFUqVWqYRotTdccUooK3aYwWAoAwPCh5blwAhU5yRSKWmxkOgw6rVMgYFSICZo9dkABqHzIACEAF5LtqeuE46UfkQzuXJtlMjBwEdzbN5ktrehIaHlWX8whpKpR+YxOXZLZsoy3rAWWzSVkEOZdiuwEv+yyAORZXJnACeyARSJRaIxWM2NLpKFs5jebxPi4I5jocsaRL2vrGL8NR1I0rTtJ0SCXmcNJISglaKnKEp6sgbwHho5z0IKQA
</br>
~~~
Sep 8 11:50:00 SC-2-1 osafimmd[4226]: WA IMMND DOWN on active controller 2
detected at standby immd!! 1. Possible failover
...
Sep 8 11:50:00 SC-2-1 osafimmd[4226]: WA Message count:10437 + 1 != 10437
Sep 8 11:50:00 SC-2-1 osafimmnd[4241]: WA DISCARD DUPLICATE FEVS message:10437
Sep 8 11:50:00 SC-2-1 osafimmnd[4241]: WA Error code 2 returned for message
type 82 - ignoring
~~~
</br>
Attached is the logs
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets