We have a similar scenario. One of our payload node rebooted, it took from a few seconds to a few minutes for other nodes to detect the node loss. Since it took the master controller a few minutes to detect the node loss and reacted to the loss, this caused serious problems and many service units went bad. Is there anyway to improve the detection time?
Thank you! Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com Proven Partner to Communications Service Providers -----Original Message----- Message: 3 Date: Tue, 14 Apr 2015 09:58:51 +0000 From: Yao Cheng LIANG <[email protected]> Subject: Re: [users] how long it takes to detect node sudden power loss To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan Palanivelu <[email protected]> Cc: "[email protected]" <[email protected]> Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1> Content-Type: text/plain; charset="windows-1255" Let me give more info about my setup: 1. I have two node, running as controller 2. Besides OpenSAF service, I have another service unit with three component in it 3. These components use Checkpoint service to data synchronization My dtmd.conf is as below: ? DTM_INI_DIS_TIMEOUT_SECS=5 DTM_TCP_KEEPIDLE_TIME=2 DTM_TCP_KEEPALIVE_INTVL=1 DTM_TCP_KEEPALIVE_PROBES=2 I read the code and found it is using TCP keepalive to detect failure of peer node. While keepalive packet will not be send until some time after the link is IDLE. I think the issue is here. Suppose ?standby? node is sending something to ?active? node, while at this time ?active? node is rebooted, ?standby? node will keeping sending this until it reaches maximum retries. In this period, the link will not be idel, thus the keepalive mechanism will not start to work. This may cause ?standby? node long time to detect failure of ?active? node. Thanks. Ted From: A V Mahesh [mailto:[email protected]] Sent: Monday, April 13, 2015 10:06 PM To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu Cc: [email protected] Subject: Re: [users] how long it takes to detect node sudden power loss Hi, Un-comment the below line to enable trace of osafdtm in /etc/opensaf/dtmd.conf #args="--tracemask=0xffffffff" ------> args="--tracemask=0xffffffff" And do `export MDS_LOG_LEVEL=5` on both node consoles before `/etc/init.d/opensafd restart` to get debuig MDS logs. -AVM On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote: Dear AVM, Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so that /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, right? Ted From: A V Mahesh [mailto:[email protected]] Sent: Monday, April 13, 2015 1:03 PM To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu Cc: [email protected]<mailto:[email protected]> Subject: Re: [users] how long it takes to detect node sudden power loss Hi Ted, On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: I did 3o times rebooting ?standby? node, and found two times it needs 1~2 minutes for the ?active? node to detect it Can you please share the following data of both nodes when ?active? node detection of standby taken 1~2 minutes. 1) #/var/log/opensaf/osafdtm 2) #/var/log/opensaf/mds.log 3) #/var/log/messages ( syslog ) 4) #top (output at the time of detection) 5) /etc/opensaf/dtmd.conf -AVM On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: I did some tests recently. I have two controllers, and I reboot one and see how long the second could detect failure of the peer. I did 3o times rebooting ?standby? node, and found two times it needs 1~2 minutes for the ?active? node to detect it. Could you anyone tell me the reason and the solution? Thanks. Ted Sent from Windows Mail From: Mathivanan Naickan Palanivelu<mailto:[email protected]> Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM To: Yao Cheng LIANG<mailto:[email protected]> Cc: [email protected]<mailto:[email protected]>, 'A V Mahesh'<mailto:[email protected]> I think since these are TCP keepalive configuration values, the connection loss would be detected immediatey in the cases of abrupt powershutdown or cable unplug. Thanks, Mathi. ----- [email protected]<mailto:[email protected]> wrote: > Is there any approach to hasten this detection, because 4 seconds is > too long for some use cases? > > Br, > > Ted > > -----Original Message----- > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, March 30, 2015 12:29 PM > To: > [email protected]<mailto:[email protected] > orge.net> > Subject: Re: [users] how long it takes to detect node sudden power > loss > > Hi, > > >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect the > node connection loss if I suddenly unplug power supply of one node? > Yes,when the connection goes down ( disconnect the cable/unplug power > supply ) in 4 seconds detect that the connection has been lost > > -AVM > > On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote: > > Dear all, > > > > If using tcp, the underlying dtms using tcp keepalive to detect > connection loss. If my dtmd.conf is as below: > > > > DTM_TCP_KEEPIDLE_TIME=2 > > > > DTM_TCP_KEEPALIVE_INTVL=1 > > > > DTM_TCP_KEEPALIVE_PROBES=2 > > > > Does that mean it needs 2 + 2*1 = 4s before the peer can detect the > node connection loss if I suddenly unplug power supply of one node? > > > > Thanks. > > > > Ted > > > > > ---------------------------------------------------------------------- > > -------- Dive into the World of Parallel Programming The Go Parallel > > > Website, sponsored by Intel and developed in partnership with > Slashdot > > Media, is your hub for all things parallel software development, > from > > weekly thought leadership blogs to news, videos, case studies, > > tutorials and more. Take a look and join the conversation now. > > http://goparallel.sourceforge.net/ > > _______________________________________________ > > Opensaf-users mailing list > > [email protected]<mailto:[email protected] > > eforge.net> > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > ---------------------------------------------------------------------- > -------- Dive into the World of Parallel Programming The Go Parallel > Website, sponsored by Intel and developed in partnership with Slashdot > Media, is your hub for all things parallel software development, from > weekly thought leadership blogs to news, videos, case studies, > tutorials and more. Take a look and join the conversation now. > http://goparallel.sourceforge.net/ > _______________________________________________ > Opensaf-users mailing list > [email protected]<mailto:[email protected] > orge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-users > > ---------------------------------------------------------------------- > -------- Dive into the World of Parallel Programming The Go Parallel > Website, sponsored by Intel and developed in partnership with Slashdot > Media, is your hub for all things parallel software development, from > weekly thought leadership blogs to news, videos, case studies, > tutorials and more. Take a look and join the conversation now. > http://goparallel.sourceforge.net/ > _______________________________________________ > Opensaf-users mailing list > [email protected]<mailto:[email protected] > orge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------ ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
