Just want to share our findings so far, not sure it's correct though.

- Coredump tells that osafntfimcnd[2697] received changes of ccb 237, and was 
sending attributeChange notification which has numAdditionalInfo=8
        CcbId = 237
        ...
            lengthAdditionalText = 0, numAdditionalInfo = 8, notificationId = 
0x6603f0,

- CCB 237 committed at 11:01:35
2016-12-07 11:01:35 SC-1 osafimmnd[431]: NO Ccb 237 COMMITTED (multiCcbOwner)
2016-12-07 11:01:35 SC-2 osafimmnd[433]: NO Ccb 237 COMMITTED (multiCcbOwner)

- ntfd-SC1 received QUIESCED csi callback at
Dec  7 11:01:34.502019 osafntfd [464:ntfs_amf.c:0175] >> amf_csi_set_callback
Dec  7 11:01:34.502022 osafntfd [464:ntfs_main.c:0271] >> 
initialize_for_assignment: ha_state = 3

- ntfd-SC2 received ACTIVE csi callback at
Dec  7 11:01:35.414545 osafntfd [452:ntfs_amf.c:0175] >> amf_csi_set_callback
Dec  7 11:01:35.414553 osafntfd [452:ntfs_main.c:0271] >> 
initialize_for_assignment: ha_state = 1

- ntf agent - amfd received MDS NEW ACTIVE
Dec  7 11:01:35.415337 osafamfd [486:ntfa_mds.c:0416] T2 MSG from NTFS 
NCSMDS_NEW_ACTIVE/UP
Dec  7 11:01:35.415356 osafamfd [486:ntfa_util.c:1550] TR Standby NTF server 
becomes new Active

So osafntfimcnd[2697] should have sent attributeChange around/after 
11:01:35.415337
According to both ntfd traces, all notifications were preserved - no loss 
during switch over. So question is which notification was sent from 
osafntfimcnd[2697]
The last notification before switchover received at ntfd-SC1 has notifId = 537, 
and ntfd-SC2 (new active) continued receiving notifId starting from 538
The one sent from osafntfimcnd[2697] should have notifId=541 (because it was 
the only attributeChange notif having numAdditionalInfo=8 after 11:01:35:415337 
and before osafntfimcnd[2697]'s death)
notif=541 was received at mds level
Dec  7 11:01:40.546538 osafntfd [452:ntfsv_enc_dec.c:0856] >> ntfsv_dec_not_msg

And processed in main thread at (delay time ~ 4 seconds)
Dec  7 11:01:44.572702 osafntfd [452:NtfAdmin.cc:0252] T2 New notification 
received, id: 541

osafntfimcnd[2697] died at
Dec  7 11:01:45.572387 osafntfd [452:ntfs_mds.c:0920] T8 MDS DOWN dest: 
2020f000002b6, node ID: 2020f, svc_id: 29

So ntfd-SC2 could not send response to osafntfimcnd[2697]
Dec  7 11:01:45.575322 osafntfd [452:ntfs_com.c:0166] >> notfication_result_lib
Dec  7 11:01:45.575336 osafntfd [452:ntfs_com.c:0172] T4 not_id: 541, rv: 1
...
Dec  7 11:01:47.083431 osafntfd [452:ntfs_com.c:0176] TR ntfs_mds_msg_send 
FAILED

Looks like there's also a delay of notif 541 from sender-ntfimcnd[2697] to its 
receiver-[ntfd-SC2] at mds level, here it seems to be ~ 5 secs
MDS docs has mentioned this

    Message loss minimization: When an MDS receiver performs a controlled 
switchover, it helps to minimize the loss of messages. The MDS provides a 
mechanism for receivers to declare switchovers. Then, the MDS agent of the 
sender starts buffering messages locally until a new “active” receiver appears. 
The buffered messages are then delivered to the new active address.

Another delay in ntfd-SC2's main thread that could not process notif=541 in 
time is the circular dependency as Gary described.
We can see many mbscv polls but could not be processed in ntfd-SC1, because it 
was stuck in termination of ntfimcnd, that results in 1 second delay top up 
ntfd-SC2 for every mbscv timeout





---

** [tickets:#2219] ntfd: circular dependency with osafntfimcnd**

**Status:** assigned
**Milestone:** 5.0.2
**Created:** Thu Dec 08, 2016 05:14 AM UTC by Gary Lee
**Last Updated:** Fri Dec 09, 2016 07:26 AM UTC
**Owner:** Praveen


A circular dependency can be seen when performing a si-swap of 
safSi=SC-2N,safApp=OpenSAF:

1. Active NTFD is trying to sync with Standby using MBC
2. Standby NTFD is the process of terminating its local osafntfimcnd. It is 
stuck in timedwait_imcn_exit() and cannot reply to the Active.
3. osafntfimcnd [on standby] is trying to send a notfication to Active NTFD

So we have (1) depending on (2) depending on (3) depending on (1)

This results in a temporary deadlock that dramatically slows down NTFD's 
ability to process its main dispatch loop. The deadlock only lasts for approx. 
1 second, when mbcsv_mds_send_msg() times out. But since there could be lots of 
MBC messages to send, sometimes osafntfimcnd is killed with SIGABRT generating 
a coredump. The si-swap operation will also timeout.

steps to reproduce
- Run loop of ntftest 32
root@SC-1:~# for i in {1..10}; do ntftest 32; done
- On another terminal, keep swapping 2N Opensaf SI, got coredump after couples 
of swaps
root@SC-1:~# amf-adm si-swap safSi=SC-2N,safApp=OpenSAF
...
root@SC-1:~# amf-adm si-swap safSi=SC-2N,safApp=OpenSAF

~~~
SC-2 (active)

There are a lot of send failures. Each taking approx. 1 second to timeout. 
During these 1 second timeouts, NTFD cannot process the main dispatch loop.
    
    Dec  7 11:01:37.531772 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:37.531781 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:38.537307 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:38.537758 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:38.537766 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:39.543180 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:39.543695 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:39.543698 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:40.545252 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:40.545719 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:40.545726 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:41.551328 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:41.551971 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:41.551979 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:42.557594 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:42.558171 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:42.558179 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:43.564051 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:43.564874 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:43.564883 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:44.572407 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:44.573262 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    Dec  7 11:01:44.573271 osafntfd [452:mbcsv_mds.c:0209] TR send type 
MDS_SENDTYPE_REDRSP:
    Dec  7 11:01:45.575091 osafntfd [452:mbcsv_mds.c:0247] << 
mbcsv_mds_send_msg: failure
    Dec  7 11:01:47.083548 osafntfd [452:mbcsv_mds.c:0185] >> 
mbcsv_mds_send_msg: sending to vdest:e
    
~~~


~~~
SC-1 (standby)

NTFD is trying to terminate osafntfimcnd. While it is doing that, it cannot 
reply to NTFD on SC-2. Meanwhile, osafntfimcnd is sending NTF notifications to 
NTFD on SC-1.
    
    Dec  7 11:01:35.453151 osafntfd [464:ntfs_imcnutil.c:0316] TR 
handle_state_ntfimcn: Terminating osafntfimcnd process
    Dec  7 11:01:45.474313 osafntfd [464:ntfs_imcnutil.c:0124] TR   Termination 
timeout
    Dec  7 11:01:45.474375 osafntfd [464:ntfs_imcnutil.c:0130] << 
wait_imcnproc_termination: rc = -1, retry_cnt = 101
    Dec  7 11:01:45.474387 osafntfd [464:ntfs_imcnutil.c:0168] TR   Normal 
termination failed. Escalate to abort
    Dec  7 11:01:45.574703 osafntfd [464:ntfs_imcnutil.c:0172] TR   Imcn 
successfully aborted
    Dec  7 11:01:45.574712 osafntfd [464:ntfs_imcnutil.c:0187] << 
timedwait_imcn_exit    
~~~



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to