[tickets] [opensaf:tickets] Re: #3277 ntf: Discarded notifications accumulation causing standby controller reboot during cold sync

Thanh Nguyen via Opensaf-tickets Tue, 17 Aug 2021 09:39:53 -0700

Hello Mohan,

Ack from me.
Best Regards,
Thanh

From: Mohan Kanakam via Opensaf-tickets <[email protected]>
Sent: Thursday, 12 August 2021 12:32 AM
To: [email protected]
Cc: Mohan Kanakam <[email protected]>
Subject: [tickets] [opensaf:tickets] #3277 ntf: Discarded notifications 
accumulation causing standby controller reboot during cold sync

The comments from Thanh are incorporated in the patch attached 
(Discardnotification_v3.patch).

Attachments:

  *   
Discardnotification_v3.patch<https://sourceforge.net/p/opensaf/tickets/_discuss/thread/a0de58386a/41e8/attachment/Discardnotification_v3.patch>
 (5.0 kB; application/octet-stream)

________________________________

[tickets:#3277]<https://sourceforge.net/p/opensaf/tickets/3277/> ntf: Discarded 
notifications accumulation causing standby controller reboot during cold sync

Status: review
Milestone: 5.21.10
Created: Mon Aug 02, 2021 01:31 PM UTC by Mohan Kanakam
Last Updated: Tue Aug 03, 2021 09:12 AM UTC
Owner: Mohan Kanakam

Ntf service accumulates lots of discarded notifications(around 2,00,000) and it 
checkpoints these discarded notifications to Standby Ntf while coming up in 
cold sync. Standby Ntf takes more than 40 seconds to process them. During this 
time, Act Ntf gets few notifications and it checkpoints(async updates) 
notifications information to Standby Ntf which is a sync call with timeout of 1 
second. Since, Standby Ntf is busy in processing cold sync, so it doesn't 
process async updates from Act Ntf and Act Ntf keeps timing out at an interval 
of 1 second for more than 40 times(i.e. more than 40 seconds).
During this time, Standby Clmd sends NtfInitialize request to Act Ntf and gets 
timeout for 4 times(40 seconds) and then Amf timesout(csi timeout 40 sec) for 
CSI and reboots the upcoming node.

The root cause is it loses down event of subscriber and never removes the 
subscriber information and discarded notifications keep increasing each time a 
notification is sent.
The notification can be missed because of less memory in the system or not able 
to send the down event in the mail box etc. We don't know the real root cause, 
but discarded notifications can be accumulated only in such cases.
We could reproduce it, please check the reproducible steps.

Steps to reproduce:
1. comment the line clientRemoveMDS() in proc_ntfa_updn_mds_msg() function in 
ntfs_evt.c file.
2. subscribe to ntf service by using ntfsubscribe.
3. send the notifications using ntfsend(ntfsend -s 1 --notificationType=0x4000 
--additionalText=TEXT --repeatSends=200000).
4. while running the ntfsend , kill the ntfsubscribe pid.
5. start the standby and see the discarded notifications in osafntfd trace file.

________________________________

Sent from sourceforge.net because 
[email protected]<mailto:[email protected]>
 is subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

---

** [tickets:#3277] ntf: Discarded notifications accumulation causing  standby 
controller reboot during cold sync**

**Status:** review
**Milestone:** 5.21.10
**Created:** Mon Aug 02, 2021 01:31 PM UTC by Mohan  Kanakam
**Last Updated:** Wed Aug 11, 2021 02:31 PM UTC
**Owner:** Mohan  Kanakam

Ntf service accumulates lots of discarded notifications(around 2,00,000) and it 
checkpoints these discarded notifications to Standby Ntf while coming up in 
cold sync. Standby Ntf takes more than 40 seconds to process them. During this 
time, Act Ntf gets few notifications and it checkpoints(async updates) 
notifications information to Standby Ntf which is a sync call with timeout of 1 
second.  Since, Standby Ntf is busy in processing cold sync, so it doesn't 
process async updates from Act Ntf and Act Ntf keeps timing out at an interval 
of 1 second for more than 40 times(i.e. more than 40 seconds).
During this time, Standby Clmd sends NtfInitialize request to Act Ntf and gets 
timeout for 4 times(40 seconds) and then Amf timesout(csi timeout 40 sec) for 
CSI and reboots the upcoming node.

The root cause is it loses down event of subscriber and never removes the 
subscriber information and discarded notifications keep increasing each time a 
notification is sent.
The notification can be missed because of less memory in the system or not able 
to send the down event in the mail box etc. We don't know the real root cause, 
but discarded notifications can be accumulated only in such cases.
We could reproduce it, please check the reproducible steps.

Steps to reproduce:
1.  comment the line  clientRemoveMDS()  in proc_ntfa_updn_mds_msg() function 
in ntfs_evt.c file.
2.  subscribe to ntf service by using ntfsubscribe.
3.  send the notifications using ntfsend(ntfsend -s 1  
--notificationType=0x4000    --additionalText=TEXT --repeatSends=200000).
4.  while running the ntfsend , kill the ntfsubscribe pid.
5.  start the standby and see the discarded notifications in osafntfd trace 
file.

---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #3277 ntf: Discarded notifications accumulation causing standby controller reboot during cold sync

Reply via email to