Hi Mohan,
There is some code in there already which will continuously send out the
initial ping to all nodes, so that if a connection is lost it will be
regained. The code is disabled by default. I think there is some
configuration which can be set to set the interval at which you want the
pings to be sent.
Is this what you are looking for?
Alex
On Mon, Sep 6, 2021 at 1:50 PM Mohan Kanakam <
[email protected]> wrote:
> Hi Gary/Minh,
> Any thoughts on this?
> Thanks
> ------------------------------
>
> * [tickets:#3280] <https://sourceforge.net/p/opensaf/tickets/3280/> dtm:
> loss of TCP connection requires node reboot*
>
> *Status:* unassigned
> *Milestone:* 5.21.10
> *Created:* Fri Aug 27, 2021 11:33 AM UTC by Mohan Kanakam
> *Last Updated:* Fri Aug 27, 2021 11:33 AM UTC
> *Owner:* Mohan Kanakam
>
> Some time we see loss of TCP connection among payloads or among controller
> and payloads in the cluster.
> Example: If we have 2 controllers and 10 payloads(starting from PL-3 to
> PL-10), we see TCP connection loss at PL-4 among PL-5. The connection of
> PL-4 with other payloads remains established.
> We also see connection loss at PL-7 with SC-2, the connection of PL-7 with
> other nodes remains established. This result in PL-7 reboot when controller
> failover happens i.e. SC-1 fails and SC-2 takes Act role. PL-7 thinks that
> there was a single controller in the cluster and it reboots.
>
> This could be reproduced by adding iptables rule to drop the packets.
>
> So, the expected behavior is dtmd on PL-4/PL-5 can retry the connection
> for few times before declaring the node is down.
> The only drawback with this approach is that it will delay the application
> failover time or even controller failover time.
>
> Any suggestion on it ??
> ------------------------------
>
> Sent from sourceforge.net because you indicated interest in
> https://sourceforge.net/p/opensaf/tickets/3280/
>
> To unsubscribe from further messages, please visit
> https://sourceforge.net/auth/subscriptions/
>
---
** [tickets:#3280] dtm: loss of TCP connection requires node reboot**
**Status:** unassigned
**Milestone:** 5.21.10
**Created:** Fri Aug 27, 2021 11:33 AM UTC by Mohan Kanakam
**Last Updated:** Mon Sep 06, 2021 11:35 PM UTC
**Owner:** Mohan Kanakam
Some time we see loss of TCP connection among payloads or among controller and
payloads in the cluster.
Example: If we have 2 controllers and 10 payloads(starting from PL-3 to PL-10),
we see TCP connection loss at PL-4 among PL-5. The connection of PL-4 with
other payloads remains established.
We also see connection loss at PL-7 with SC-2, the connection of PL-7 with
other nodes remains established. This result in PL-7 reboot when controller
failover happens i.e. SC-1 fails and SC-2 takes Act role. PL-7 thinks that
there was a single controller in the cluster and it reboots.
This could be reproduced by adding iptables rule to drop the packets.
So, the expected behavior is dtmd on PL-4/PL-5 can retry the connection for few
times before declaring the node is down.
The only drawback with this approach is that it will delay the application
failover time or even controller failover time.
Any suggestion on it ??
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets