Hi William,

Even though  from  opensaf 4.4.0 to opensaf 4.7.0 their are considerable 
change in DTM socket and their option and
default buffers configuration ect , the basic issues like `payloads and 
controllers periodically lose contact with the cluster`
issue shouldn't happen even with  opensaf 4.4.0, We have been 
using/testing the  virtual machine's Network with both
Host-only networking  Bridged networking adopter  options and we haven't 
faced such issue.

So can your please share your virtual machines network adopter 
configuration.
Are you observing the same  behavior with opensaf 4.7.0  as well ?

-AVM

On 10/27/2015 4:26 AM, William R Elliott wrote:
> Hello All,
> We are currently using opensaf 4.4.0.  We have a cluster that is running on 
> redhat 6 on virtual machines.  For some unknown reason various payloads and 
> controlers periodically lose contact with the cluster.  The /var/log/messages 
> logs don't tell us anything that we can see except the node lost message.
>
> I've been looking at the dtmd code to see if I can get some idea about what 
> to start looking at to try to figure out what's going on.  One of the things 
> I've been researching is the TCP idle, interval, and probes settings in the 
> dtmd.conf file.  From what I can tell so far, the code indicates these values 
> are set in the DTM_INTERNODE_CB structure and are used to set attributes on 
> the socket by calling the setsockopt function.  So it seems to me dtmd is 
> relying on the TCP keep alive functionality to determine if a node is lost.  
> Currently it looks like the idle time is set to 2 seconds, the interval is 1 
> second, and the number of probes is 2.  Therefore, if the socket is idle for 
> 2 seconds, a keep alive probe will be invoked, if no acknowledgement, after a 
> one second interval, the next probe will be invoked and if still no 
> acknowledgement the lost node message is issued.
>
> Since this kind of lower level TCP functionality is new to me, I started 
> researching TCP keep alive and encountered the following statements 
> concerning relying on TCP keep alive functionality to tell if communication 
> has been lost:
>
> Do NOT try to use TCP Keepalive to detect TCP socket failure more quickly 
> than a few minutes. People who try to set it for 5 seconds (or for 
> milliseconds) invariably cause serious compatibility issues with other 
> products - and invariably fail to be satisfied.
> If you truly require detecting a TCP socket failure in 1 second or less, 
> which implies your TCP peers normally send data many times per second, then 
> use non-blocking sockets with the "socket.timeout" exception to detect
> when no data had been received in your required time-frame. And if you accept 
> that a TCP peer quiet for 1 second is bad, then close the socket manually and 
> attempt recovery directly. Do not use TCP Keepalive for such short-period 
> detection.
>
> Or the following link to a forum site that have several comments discouraging 
> relying on TCP keep alive to determine if a connection is alive:
> http://stackoverflow.com/questions/15230922/keepalive-time-cannot-reduce-below-one-minute-in-c
>
> This is the first time I have looked in to dtmd and since I don't have the 
> history and experience, it's possible I have missed something, or miss 
> understood the code.  So here are my questions:
>
> 1)      Am I correct that dtmd relies on TCP keep alive to determine if a 
> connection is alive?
>
> 2)      Since I don't have acces to nor will I be given acces to this 
> environment I've mentioned, are there any utilities besides ping, 
> traceroute... that I can ask the users of this environment to run to help 
> determine what could be causing the periodic lost nodes?  I'm currently 
> looking at tcpdump, and writing a python script that uses the socket APIs to 
> connect to a particular port and use keep alive functionality.
>
> Any suggestions would be greatly appreciated.
>
>
> thanks
>
>
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity 
> to which it is addressed and may contain confidential, proprietary and/or 
> privileged material. Any review, retransmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users


------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to