[users] Question concerning opensaf and TCP keep alive

William R Elliott Mon, 26 Oct 2015 15:57:55 -0700

Hello All,
We are currently using opensaf 4.4.0.  We have a cluster that is running on 
redhat 6 on virtual machines.  For some unknown reason various payloads and 
controlers periodically lose contact with the cluster.  The /var/log/messages 
logs don't tell us anything that we can see except the node lost message.

I've been looking at the dtmd code to see if I can get some idea about what to
start looking at to try to figure out what's going on. One of the things I've
been researching is the TCP idle, interval, and probes settings in the
dtmd.conf file. From what I can tell so far, the code indicates these values
are set in the DTM_INTERNODE_CB structure and are used to set attributes on the
socket by calling the setsockopt function. So it seems to me dtmd is relying
on the TCP keep alive functionality to determine if a node is lost. Currently
it looks like the idle time is set to 2 seconds, the interval is 1 second, and
the number of probes is 2. Therefore, if the socket is idle for 2 seconds, a
keep alive probe will be invoked, if no acknowledgement, after a one second
interval, the next probe will be invoked and if still no acknowledgement the
lost node message is issued.

Since this kind of lower level TCP functionality is new to me, I started
researching TCP keep alive and encountered the following statements concerning
relying on TCP keep alive functionality to tell if communication has been lost:

Do NOT try to use TCP Keepalive to detect TCP socket failure more quickly than
a few minutes. People who try to set it for 5 seconds (or for milliseconds)
invariably cause serious compatibility issues with other products - and
invariably fail to be satisfied.
If you truly require detecting a TCP socket failure in 1 second or less, which
implies your TCP peers normally send data many times per second, then use
non-blocking sockets with the "socket.timeout" exception to detect
when no data had been received in your required time-frame. And if you accept
that a TCP peer quiet for 1 second is bad, then close the socket manually and
attempt recovery directly. Do not use TCP Keepalive for such short-period
detection.

Or the following link to a forum site that have several comments discouraging
relying on TCP keep alive to determine if a connection is alive:
http://stackoverflow.com/questions/15230922/keepalive-time-cannot-reduce-below-one-minute-in-c

This is the first time I have looked in to dtmd and since I don't have the
history and experience, it's possible I have missed something, or miss
understood the code. So here are my questions:

1) Am I correct that dtmd relies on TCP keep alive to determine if a
connection is alive?

2) Since I don't have acces to nor will I be given acces to this
environment I've mentioned, are there any utilities besides ping, traceroute...
that I can ask the users of this environment to run to help determine what
could be causing the periodic lost nodes? I'm currently looking at tcpdump,
and writing a python script that uses the socket APIs to connect to a
particular port and use keep alive functionality.

Any suggestions would be greatly appreciated.

thanks

________________________________
The information transmitted herein is intended only for the person or entity to
which it is addressed and may contain confidential, proprietary and/or
privileged material. Any review, retransmission, dissemination or other use of,
or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this
in error, please contact the sender and delete the material from any computer.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

[users] Question concerning opensaf and TCP keep alive

Reply via email to