Hi Gary,
Is it the case when some configuration change was done and within not time
switchover was triggered? In this case active IMCN was still processing IMM
callbacks and sending notification for them. Since switchover got triggered
when IMCN was processing CCB calbacks, it will get stuck in
saNotificationSend() API as active NTFS was not available. As soon as active
NTFS gets available, IMCN will clear its IMM callback list and will be able to
act on term signal. But new active NTFS will have some notifications related to
switchover for checkpointing to standby, so active NTFS will also get stuck.
Attach is the patch (2219_v0.patch) in which NTFS does not terminate IMCN in
csi_set callback. It posts a local event with low priority to NTFS mailbox for
terminating IMCN. Since standby will not stuck in csi set callback for IMCN
termination, it will allow completion of si-swap operation and other high
priority messages.
I think as a long term solution, termination of IMCN must be avoided.
Thanks,
Praveen
Attachments:
-
[2219_v0.patch](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/327b0222/a94c/attachment/2219_v0.patch)
(3.5 kB; application/octet-stream)
---
** [tickets:#2219] ntfd: circular dependency with osafntfimcnd**
**Status:** assigned
**Milestone:** 5.0.2
**Created:** Thu Dec 08, 2016 05:14 AM UTC by Gary Lee
**Last Updated:** Thu Dec 08, 2016 06:48 AM UTC
**Owner:** Praveen
A circular dependency can be seen when performing a si-swap of
safSi=SC-2N,safApp=OpenSAF:
1. Active NTFD is trying to sync with Standby using MBC
2. Standby NTFD is the process of terminating its local osafntfimcnd. It is
stuck in timedwait_imcn_exit() and cannot reply to the Active.
3. osafntfimcnd [on standby] is trying to send a notfication to Active NTFD
So we have (1) depending on (2) depending on (3) depending on (1)
This results in a temporary deadlock that dramatically slows down NTFD's
ability to process its main dispatch loop. The deadlock only lasts for approx.
1 second, when mbcsv_mds_send_msg() times out. But since there could be lots of
MBC messages to send, sometimes osafntfimcnd is killed with SIGABRT generating
a coredump. The si-swap operation will also timeout.
~~~
SC-2 (active)
There are a lot of send failures. Each taking approx. 1 second to timeout.
During these 1 second timeouts, NTFD cannot process the main dispatch loop.
Dec 7 11:01:37.531772 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:37.531781 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:38.537307 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:38.537758 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:38.537766 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:39.543180 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:39.543695 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:39.543698 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:40.545252 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:40.545719 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:40.545726 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:41.551328 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:41.551971 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:41.551979 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:42.557594 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:42.558171 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:42.558179 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:43.564051 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:43.564874 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:43.564883 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:44.572407 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:44.573262 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:44.573271 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:45.575091 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:47.083548 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
~~~
~~~
SC-1 (standby)
NTFD is trying to terminate osafntfimcnd. While it is doing that, it cannot
reply to NTFD on SC-2. Meanwhile, osafntfimcnd is sending NTF notifications to
NTFD on SC-1.
Dec 7 11:01:35.453151 osafntfd [464:ntfs_imcnutil.c:0316] TR
handle_state_ntfimcn: Terminating osafntfimcnd process
Dec 7 11:01:45.474313 osafntfd [464:ntfs_imcnutil.c:0124] TR Termination
timeout
Dec 7 11:01:45.474375 osafntfd [464:ntfs_imcnutil.c:0130] <<
wait_imcnproc_termination: rc = -1, retry_cnt = 101
Dec 7 11:01:45.474387 osafntfd [464:ntfs_imcnutil.c:0168] TR Normal
termination failed. Escalate to abort
Dec 7 11:01:45.574703 osafntfd [464:ntfs_imcnutil.c:0172] TR Imcn
successfully aborted
Dec 7 11:01:45.574712 osafntfd [464:ntfs_imcnutil.c:0187] <<
timedwait_imcn_exit
~~~
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets