>> As mentioned, why doesn't opensaf simply try and re-establish a TCP connection? This could be pegged and if so many TCP connections need to be re-established in a certain time period, then escalate to a node reboot.
I think, this will be a good feature in general, especially when headless feature is enabled. Thanks & Best Regards -Nagendra, +91-9866424860 www.GetHighAvailability.com Get High Availability Today! NJ, USA: +1 508-507-6507 | Hyderabad, India: +91 798-992-5293 -----Original Message----- From: Hoyt, David [mailto:dh...@rbbn.com] Sent: 22 January 2021 01:22 To: opensaf-users@lists.sourceforge.net Subject: [users] loss of TCP connection requires node reboot Hi all, Is there an option to just try and re-establish a TCP connection if the TCP socket gets closed for some reason? And, is there an option to just restart opensaf instead of a node reboot? I'm investigating an issue whereby the TCP socket is getting closed by a TCP timeout. It appears our root cause may be due to a high CPU issue but the question came up, why reboot the node when all that was required was to re-establish the TCP connection? Set up: 2 nodes, each running RHEL7, using OpenSAF 5.19.10 where node01 is the SC and node02 a PL. Occasionally, the PL node is going for a reboot and we found that prior to this, the SC generates the following log: Jan 11 19:07:50 node01 osafdtmd[1861]: ER recv() from node 0x2030f failed, errno=110 Jan 11 19:07:50 node01 osafdtmd[1861]: NO Lost contact with 'node02' Jan 11 19:07:50 node01 osafimmd[2179]: NO MDS event from svc_id 25 (change:4, dest:566312912815899) Jan 11 19:07:50 node01 osafamfd[2334]: NO Node 'PL-3' left the cluster The errno=110 logged above is a TCP socket error that maps to connection timed-out. The hydra feature is enabled and the corresponding logs on the PL node show that the connection with the SC is lost: Jan 11 19:07:50 node02 osafdtmd[1738]: NO Lost contact with 'node01' Jan 11 19:07:50 node02 osafamfnd[2107]: WA AMF director unexpectedly crashed Jan 11 19:07:50 node02 osafamfnd[2107]: NO Checking 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages Jan 11 19:07:50 node02 osafamfnd[2107]: NO Checking 'safSu=SU1,safSg=app1,safApp=app1' for pending messages Jan 11 19:07:50 node02 osafimmnd[1819]: WA SC Absence IS allowed:900 IMMD service is DOWN Since IMMSV_SC_ABSENCE_ALLOWED=900, then 15 minutes later, the PL node initiates a node reboot: Jan 11 19:22:50 node02 osafamfnd[2107]: ER AMF director absence timeout Jan 11 19:22:50 node02 osafamfnd[2107]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: AMF director absence timeout, OwnNodeId = 131855, SupervisionTime = 60 Jan 11 19:22:50 node02 opensaf_reboot: Rebooting local node; timeout=60 The node reboots, comes back up and all is fine. As mentioned, why doesn't opensaf simply try and re-establish a TCP connection? This could be pegged and if so many TCP connections need to be re-established in a certain time period, then escalate to a node reboot. Regards, David ---------------------------------------------------------------------------- ------------------------------------------- Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. ---------------------------------------------------------------------------- ------------------------------------------- _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users