Hello All,
We are currently using opensaf 4.4.0.  We have a cluster that is running on 
redhat 6 on virtual machines.  For some unknown reason various payloads and 
controlers periodically lose contact with the cluster.  The /var/log/messages 
logs don't tell us anything that we can see except the node lost message.

I've been looking at the dtmd code to see if I can get some idea about what to 
start looking at to try to figure out what's going on.  One of the things I've 
been researching is the TCP idle, interval, and probes settings in the 
dtmd.conf file.  From what I can tell so far, the code indicates these values 
are set in the DTM_INTERNODE_CB structure and are used to set attributes on the 
socket by calling the setsockopt function.  So it seems to me dtmd is relying 
on the TCP keep alive functionality to determine if a node is lost.  Currently 
it looks like the idle time is set to 2 seconds, the interval is 1 second, and 
the number of probes is 2.  Therefore, if the socket is idle for 2 seconds, a 
keep alive probe will be invoked, if no acknowledgement, after a one second 
interval, the next probe will be invoked and if still no acknowledgement the 
lost node message is issued.

Since this kind of lower level TCP functionality is new to me, I started 
researching TCP keep alive and encountered the following statements concerning 
relying on TCP keep alive functionality to tell if communication has been lost:

Do NOT try to use TCP Keepalive to detect TCP socket failure more quickly than 
a few minutes. People who try to set it for 5 seconds (or for milliseconds) 
invariably cause serious compatibility issues with other products - and 
invariably fail to be satisfied.
If you truly require detecting a TCP socket failure in 1 second or less, which 
implies your TCP peers normally send data many times per second, then use 
non-blocking sockets with the "socket.timeout" exception to detect
when no data had been received in your required time-frame. And if you accept 
that a TCP peer quiet for 1 second is bad, then close the socket manually and 
attempt recovery directly. Do not use TCP Keepalive for such short-period 
detection.

Or the following link to a forum site that have several comments discouraging 
relying on TCP keep alive to determine if a connection is alive:
http://stackoverflow.com/questions/15230922/keepalive-time-cannot-reduce-below-one-minute-in-c

This is the first time I have looked in to dtmd and since I don't have the 
history and experience, it's possible I have missed something, or miss 
understood the code.  So here are my questions:

1)      Am I correct that dtmd relies on TCP keep alive to determine if a 
connection is alive?

2)      Since I don't have acces to nor will I be given acces to this 
environment I've mentioned, are there any utilities besides ping, traceroute... 
that I can ask the users of this environment to run to help determine what 
could be causing the periodic lost nodes?  I'm currently looking at tcpdump, 
and writing a python script that uses the socket APIs to connect to a 
particular port and use keep alive functionality.

Any suggestions would be greatly appreciated.


thanks




________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to