[users] loss of TCP connection requires node reboot

Hoyt, David Thu, 21 Jan 2021 11:52:50 -0800

Hi all,

Is there an option to just try and re-establish a TCP connection if the TCP 
socket gets closed for some reason?
And, is there an option to just restart opensaf instead of a node reboot?


I'm investigating an issue whereby the TCP socket is getting closed by a TCP 
timeout.
It appears our root cause may be due to a high CPU issue but the question came 
up, why reboot the node when all that was required was to re-establish the TCP 
connection?


Set up: 2 nodes, each running RHEL7, using OpenSAF 5.19.10 where node01 is the 
SC and node02 a PL.

Occasionally, the PL node is going for a reboot and we found that prior to 
this, the SC generates the following log:
Jan 11 19:07:50 node01 osafdtmd[1861]: ER recv() from node 0x2030f failed, 
errno=110
Jan 11 19:07:50 node01 osafdtmd[1861]: NO Lost contact with 'node02'
Jan 11 19:07:50 node01 osafimmd[2179]: NO MDS event from svc_id 25 (change:4, 
dest:566312912815899)
Jan 11 19:07:50 node01 osafamfd[2334]: NO Node 'PL-3' left the cluster


The errno=110 logged above is a TCP socket error that maps to connection 
timed-out.
The hydra feature is enabled and the corresponding logs on the PL node show 
that the connection with the SC is lost:
Jan 11 19:07:50 node02 osafdtmd[1738]: NO Lost contact with 'node01'
Jan 11 19:07:50 node02 osafamfnd[2107]: WA AMF director unexpectedly crashed
Jan 11 19:07:50 node02 osafamfnd[2107]: NO Checking 
'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages
Jan 11 19:07:50 node02 osafamfnd[2107]: NO Checking 
'safSu=SU1,safSg=app1,safApp=app1' for pending messages
Jan 11 19:07:50 node02 osafimmnd[1819]: WA SC Absence IS allowed:900 IMMD 
service is DOWN


Since IMMSV_SC_ABSENCE_ALLOWED=900, then 15 minutes later, the PL node 
initiates a node reboot:
Jan 11 19:22:50 node02 osafamfnd[2107]: ER AMF director absence timeout
Jan 11 19:22:50 node02 osafamfnd[2107]: Rebooting OpenSAF NodeId = 131855 EE 
Name = , Reason: AMF director absence timeout, OwnNodeId = 131855, 
SupervisionTime = 60
Jan 11 19:22:50 node02 opensaf_reboot: Rebooting local node; timeout=60

The node reboots, comes back up and all is fine.
As mentioned, why doesn't opensaf simply try and re-establish a TCP connection?
This could be pegged and if so many TCP connections need to be re-established 
in a certain time period, then escalate to a node reboot.


Regards,
David


-----------------------------------------------------------------------------------------------------------------------
Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. that
is confidential and/or proprietary for the sole use of the intended recipient.  
Any review, disclosure, reliance or
distribution by others or forwarding without express permission is strictly 
prohibited.  If you are not the intended
recipient, please notify the sender immediately and then delete all copies, 
including any attachments.
-----------------------------------------------------------------------------------------------------------------------

_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

[users] loss of TCP connection requires node reboot

Reply via email to