Hi
The original specificatinon for IMCN said that it should be able to handle HA 
state changes. A first increment of IMCN without any HA handling except the 
possibility to inform the IMCN process if it shall run as active or standby 
when started. To handle a HA state change the IMCN process must be terminated 
and restarted given the correct state. 
In the "IMCN NTF notification protocol" a special notification was defined (See 
Error Report Notification in the OpenSAF Extensions PR document) to be sent if 
there is a risk that any IMM changes may have been missed. This notification is 
sent every time IMCN is started as active. If a client receives such a 
notification it has to read all relevant IMM data in order to synchronize 
itself. After that it can rely on notifications from IMCN until the next Error 
report notification is received.
The intention was to make IMCN able to change role without beeing restarted and 
without loosing track of IMM changes (ticket #157). However the first increment 
was considered good enough so #157 was never implemented.
This means that IMCN can be terminated at any time also if executing an IMM 
callback function. It does not matter if any IMM change is lost and never 
reported. Since IMCN is a process and it is the process that is terminated 
there is no problem with memory leeks or resource leeks. IMM will detect that 
the client is gone and will finalize all resources.
In NTF the IMCN surveillance thread will also detect that the IMCN process is 
gone and start a new one and give it the correct state (which may no longer be 
active)
This means that any solution for immediate termination is Ok e.g. to handle the 
termination in a separate thread to avoid the described deadlock.
NOTE:
SInce there is no HA handling implemented it is actually not needed to start 
any IMCN process on any other node than the active. That this is done is also 
remainings from the original intention.

MUST BE FIXED!
I saw that the init_ntfimcn() has been move from the main initialize() to 
initialize_for_assignment(). This is NOT OK since initialize_for_assignment() 
is called more than once meaning that also init_ntfimcn() is called more than 
once. The init_ntfimcn() function starts the IMCN surveillance thread and that 
cannot be done more than once!
Other init functions that are called from initialize_for_assignment() has an 
"is_initialized flag" in order to prevent initialization more than once. This 
could be a solution to implement for init_ntfimcn() as well.

Thanks
Lennart


---

** [tickets:#2219] ntfd: circular dependency with osafntfimcnd**

**Status:** assigned
**Milestone:** 5.0.2
**Created:** Thu Dec 08, 2016 05:14 AM UTC by Gary Lee
**Last Updated:** Mon Dec 12, 2016 09:33 AM UTC
**Owner:** Praveen


A circular dependency can be seen when performing a si-swap of 
safSi=SC-2N,safApp=OpenSAF:

1. Active NTFD is trying to sync with Standby using MBC
2. Standby NTFD is the process of terminating its local osafntfimcnd. It is 
stuck in timedwait_imcn_exit() and cannot reply to the Active.
3. osafntfimcnd [on standby] is trying to send a notfication to Active NTFD

So we have (1) depending on (2) depending on (3) depending on (1)

This results in a temporary deadlock that dramatically slows down NTFD's 
ability to process its main dispatch loop. The deadlock only lasts for approx. 
1 second, when mbcsv_mds_send_msg() times out. But since there could be lots of 
MBC messages to send, sometimes osafntfimcnd is killed with SIGABRT generating 
a coredump. The si-swap operation will also timeout.

steps to reproduce
- Run loop of ntftest 32
root@SC-1:~# for i in {1..10}; do ntftest 32; done
- On another terminal, keep swapping 2N Opensaf SI, got coredump after couples 
of swaps
root@SC-1:~# amf-adm si-swap safSi=SC-2N,safApp=OpenSAF
...
root@SC-1:~# amf-adm si-swap safSi=SC-2N,safApp=OpenSAF

~~~
SC-2 (active)

There are a lot of send failures. Each taking approx. 1 second to timeout. 
During these 1 second timeouts, NTFD cannot process the main dispatch loop.
    
    Dec  7 11:01:37.531772 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:37.531781 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:38.537307 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:38.537758 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:38.537766 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:39.543180 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:39.543695 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:39.543698 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:40.545252 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:40.545719 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:40.545726 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:41.551328 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:41.551971 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:41.551979 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:42.557594 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:42.558171 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:42.558179 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:43.564051 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:43.564874 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:43.564883 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:44.572407 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:44.573262 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:44.573271 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:45.575091 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:47.083548 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    
~~~


~~~
SC-1 (standby)

NTFD is trying to terminate osafntfimcnd. While it is doing that, it cannot 
reply to NTFD on SC-2. Meanwhile, osafntfimcnd is sending NTF notifications to 
NTFD on SC-1.
    
    Dec  7 11:01:35.453151 osafntfd [464:ntfs_imcnutil.c:0316] TR 
handle_state_ntfimcn: Terminating osafntfimcnd process
    Dec  7 11:01:45.474313 osafntfd [464:ntfs_imcnutil.c:0124] TR   Termination 
timeout
    Dec  7 11:01:45.474375 osafntfd [464:ntfs_imcnutil.c:0130] << 
wait_imcnproc_termination: rc = -1, retry_cnt = 101
    Dec  7 11:01:45.474387 osafntfd [464:ntfs_imcnutil.c:0168] TR   Normal 
termination failed. Escalate to abort
    Dec  7 11:01:45.574703 osafntfd [464:ntfs_imcnutil.c:0172] TR   Imcn 
successfully aborted
    Dec  7 11:01:45.574712 osafntfd [464:ntfs_imcnutil.c:0187] << 
timedwait_imcn_exit    
~~~



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to