>> As mentioned, why doesn't opensaf simply try and re-establish a TCP
connection?
This could be pegged and if so many TCP connections need to be
re-established in a certain time period, then escalate to a node reboot.

I think, this will be a good feature in general, especially when headless
feature is enabled.

Thanks & Best Regards
-Nagendra, +91-9866424860

www.GetHighAvailability.com
Get High Availability Today!
NJ, USA: +1 508-507-6507   |   Hyderabad, India: +91 798-992-5293


-----Original Message-----
From: Hoyt, David [mailto:dh...@rbbn.com] 
Sent: 22 January 2021 01:22
To: opensaf-users@lists.sourceforge.net
Subject: [users] loss of TCP connection requires node reboot

Hi all,

Is there an option to just try and re-establish a TCP connection if the TCP
socket gets closed for some reason?
And, is there an option to just restart opensaf instead of a node reboot?

I'm investigating an issue whereby the TCP socket is getting closed by a TCP
timeout.
It appears our root cause may be due to a high CPU issue but the question
came up, why reboot the node when all that was required was to re-establish
the TCP connection?


Set up: 2 nodes, each running RHEL7, using OpenSAF 5.19.10 where node01 is
the SC and node02 a PL.

Occasionally, the PL node is going for a reboot and we found that prior to
this, the SC generates the following log:
Jan 11 19:07:50 node01 osafdtmd[1861]: ER recv() from node 0x2030f failed,
errno=110
Jan 11 19:07:50 node01 osafdtmd[1861]: NO Lost contact with 'node02'
Jan 11 19:07:50 node01 osafimmd[2179]: NO MDS event from svc_id 25
(change:4, dest:566312912815899)
Jan 11 19:07:50 node01 osafamfd[2334]: NO Node 'PL-3' left the cluster


The errno=110 logged above is a TCP socket error that maps to connection
timed-out.
The hydra feature is enabled and the corresponding logs on the PL node show
that the connection with the SC is lost:
Jan 11 19:07:50 node02 osafdtmd[1738]: NO Lost contact with 'node01'
Jan 11 19:07:50 node02 osafamfnd[2107]: WA AMF director unexpectedly crashed
Jan 11 19:07:50 node02 osafamfnd[2107]: NO Checking
'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages
Jan 11 19:07:50 node02 osafamfnd[2107]: NO Checking
'safSu=SU1,safSg=app1,safApp=app1' for pending messages
Jan 11 19:07:50 node02 osafimmnd[1819]: WA SC Absence IS allowed:900 IMMD
service is DOWN


Since IMMSV_SC_ABSENCE_ALLOWED=900, then 15 minutes later, the PL node
initiates a node reboot:
Jan 11 19:22:50 node02 osafamfnd[2107]: ER AMF director absence timeout
Jan 11 19:22:50 node02 osafamfnd[2107]: Rebooting OpenSAF NodeId = 131855 EE
Name = , Reason: AMF director absence timeout, OwnNodeId = 131855,
SupervisionTime = 60
Jan 11 19:22:50 node02 opensaf_reboot: Rebooting local node; timeout=60

The node reboots, comes back up and all is fine.
As mentioned, why doesn't opensaf simply try and re-establish a TCP
connection?
This could be pegged and if so many TCP connections need to be
re-established in a certain time period, then escalate to a node reboot.


Regards,
David


----------------------------------------------------------------------------
-------------------------------------------
Notice: This e-mail together with any attachments may contain information of
Ribbon Communications Inc. that
is confidential and/or proprietary for the sole use of the intended
recipient.  Any review, disclosure, reliance or
distribution by others or forwarding without express permission is strictly
prohibited.  If you are not the intended
recipient, please notify the sender immediately and then delete all copies,
including any attachments.
----------------------------------------------------------------------------
-------------------------------------------

_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users



_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to