[tickets] [opensaf:tickets] Re: #3280 dtm: loss of TCP connection requires node reboot

Mohan Kanakam via Opensaf-tickets Thu, 09 Sep 2021 07:56:16 -0700

Hi Alex,
Thanks for your help. I did find the option to enable the broadcast after 
initial discovery.
Below is my testing observation. Please kindly comment.
Case1:(SC-1,SC-2)
1)Initially both nodes are connected as Active and Standby.
2)  I drop  the packets on the 6700 port by  adding iptable rule, then 2 nodes 
become active individually.
3)  I deleted the iptable rule, then both nodes are trying to connect each 
other and both got rebooted because they detected as spilt brain.


Case2:(SC-1,PL-3)
1)Initially both nodes are connected as Active controller and payload.
2) I drop the packets on the 6700 port by adding iptable rule, then payload got 
rebooted and active remains up.

I also observed that all the databases are being cleaned at controllers. So how 
reconnect will help. For example, ntfs will delete ntf agent information if 
node hosting ntf agent leaves the cluster. Now if it joins ntf agent will not 
register, so there is a mismatch in the cluster. The same stands true for EDS. 
Incase payload is down then controller delete all the information of 
application, even if it  reconnects the amf application will not re register.

So, I am not sure in which case, fix of 2522 will help.


---

** [tickets:#3280] dtm: loss of TCP connection requires node reboot**

**Status:** unassigned
**Milestone:** 5.21.10
**Created:** Fri Aug 27, 2021 11:33 AM UTC by Mohan  Kanakam
**Last Updated:** Mon Sep 06, 2021 11:35 PM UTC
**Owner:** Mohan  Kanakam


Some time we see loss of TCP connection among payloads or among controller and 
payloads in the cluster.
Example: If we have 2 controllers and 10 payloads(starting from PL-3 to PL-10), 
we see TCP connection loss at PL-4 among  PL-5. The connection of PL-4 with 
other payloads remains established.
We also see connection loss at PL-7 with SC-2, the connection of PL-7 with 
other nodes remains established. This result in PL-7 reboot when controller 
failover happens i.e. SC-1 fails and SC-2 takes Act role. PL-7 thinks that 
there was a single controller in the cluster and it reboots.

This could be reproduced by adding iptables rule to drop the packets.

So, the expected behavior is dtmd on PL-4/PL-5 can retry the connection for few 
times before declaring the node is down.
The only drawback with this approach is that it will delay the application 
failover time or even controller failover time.

Any suggestion  on it ??


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #3280 dtm: loss of TCP connection requires node reboot

Reply via email to