Hi, You could try the fix in this ticket https://sourceforge.net/p/opensaf/tickets/2014/ and see if the scenario is the same The patch In https://sourceforge.net/p/opensaf/staging/ci/b30d5e33e50c7eea8cc1730cbe0a0dde572621f0/
Thanks, Mathi. > -----Original Message----- > From: Shu Wang [mailto:[email protected]] > Sent: Saturday, June 20, 2015 1:50 AM > To: [email protected] > Subject: Re: [users] how long it takes to detect node sudden power > > We have a similar scenario. One of our payload node rebooted, it took from a > few seconds to a few minutes for other nodes to detect the node loss. Since > it took the master controller a few minutes to detect the node loss and > reacted to the loss, this caused serious problems and many service units > went bad. Is there anyway to improve the detection time? > > Thank you! > > Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| > www.NetCracker.com Proven Partner to Communications Service Providers > > -----Original Message----- > Message: 3 > Date: Tue, 14 Apr 2015 09:58:51 +0000 > From: Yao Cheng LIANG <[email protected]> > Subject: Re: [users] how long it takes to detect node sudden power > loss > To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan > Palanivelu <[email protected]> > Cc: "[email protected]" > <[email protected]> > Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1> > Content-Type: text/plain; charset="windows-1255" > > Let me give more info about my setup: > > > 1. I have two node, running as controller > > 2. Besides OpenSAF service, I have another service unit with three > component in it > > 3. These components use Checkpoint service to data synchronization > > > > My dtmd.conf is as below: > > ? > > DTM_INI_DIS_TIMEOUT_SECS=5 > > > > DTM_TCP_KEEPIDLE_TIME=2 > > > > DTM_TCP_KEEPALIVE_INTVL=1 > > > > DTM_TCP_KEEPALIVE_PROBES=2 > > > > I read the code and found it is using TCP keepalive to detect failure of peer > node. While keepalive packet will not be send until some time after the link > is > IDLE. I think the issue is here. Suppose ?standby? node is sending something > to ?active? node, while at this time ?active? node is rebooted, ?standby? > node will keeping sending this until it reaches maximum retries. In this > period, the link will not be idel, thus the keepalive mechanism will not > start to > work. This may cause ?standby? node long time to detect failure of ?active? > node. > > Thanks. > > > > Ted > > > > > > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, April 13, 2015 10:06 PM > To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu > Cc: [email protected] > Subject: Re: [users] how long it takes to detect node sudden power loss > > Hi, > > Un-comment the below line to enable trace of osafdtm in > /etc/opensaf/dtmd.conf > > #args="--tracemask=0xffffffff" ------> args="--tracemask=0xffffffff" > > And do `export MDS_LOG_LEVEL=5` on both node consoles before > `/etc/init.d/opensafd restart` to get debuig MDS logs. > > > -AVM > > On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote: > Dear AVM, > > Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so that > /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, right? > > Ted > > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, April 13, 2015 1:03 PM > To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu > Cc: [email protected]<mailto:opensaf- > [email protected]> > Subject: Re: [users] how long it takes to detect node sudden power loss > > Hi Ted, > > On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: > I did 3o times rebooting ?standby? node, and found two times it needs 1~2 > minutes for the ?active? node to detect it > > Can you please share the following data of both nodes when ?active? node > detection of standby taken 1~2 minutes. > > 1) #/var/log/opensaf/osafdtm > 2) #/var/log/opensaf/mds.log > 3) #/var/log/messages ( syslog ) > > 4) #top (output at the time of detection) > 5) /etc/opensaf/dtmd.conf > > -AVM > > On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: > I did some tests recently. I have two controllers, and I reboot one and see > how long the second could detect failure of the peer. I did 3o times > rebooting ?standby? node, and found two times it needs 1~2 minutes for the > ?active? node to detect it. Could you anyone tell me the reason and the > solution? > > Thanks. > > Ted > > Sent from Windows Mail > > From: Mathivanan Naickan Palanivelu<mailto:[email protected]> > Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM > To: Yao Cheng LIANG<mailto:[email protected]> > Cc: [email protected]<mailto:opensaf- > [email protected]>, 'A V > Mahesh'<mailto:[email protected]> > > I think since these are TCP keepalive configuration values, the connection > loss would be detected immediatey in the cases of abrupt powershutdown > or cable unplug. > > Thanks, > Mathi. > > ----- [email protected]<mailto:[email protected]> wrote: > > > Is there any approach to hasten this detection, because 4 seconds is > > too long for some use cases? > > > > Br, > > > > Ted > > > > -----Original Message----- > > From: A V Mahesh [mailto:[email protected]] > > Sent: Monday, March 30, 2015 12:29 PM > > To: > > [email protected]<mailto:[email protected] > > orge.net> > > Subject: Re: [users] how long it takes to detect node sudden power > > loss > > > > Hi, > > > > >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect the > > node connection loss if I suddenly unplug power supply of one node? > > Yes,when the connection goes down ( disconnect the cable/unplug power > > supply ) in 4 seconds detect that the connection has been lost > > > > -AVM > > > > On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote: > > > Dear all, > > > > > > If using tcp, the underlying dtms using tcp keepalive to detect > > connection loss. If my dtmd.conf is as below: > > > > > > DTM_TCP_KEEPIDLE_TIME=2 > > > > > > DTM_TCP_KEEPALIVE_INTVL=1 > > > > > > DTM_TCP_KEEPALIVE_PROBES=2 > > > > > > Does that mean it needs 2 + 2*1 = 4s before the peer can detect the > > node connection loss if I suddenly unplug power supply of one node? > > > > > > Thanks. > > > > > > Ted > > > > > > > > ---------------------------------------------------------------------- > > > -------- Dive into the World of Parallel Programming The Go Parallel > > > > > Website, sponsored by Intel and developed in partnership with > > Slashdot > > > Media, is your hub for all things parallel software development, > > from > > > weekly thought leadership blogs to news, videos, case studies, > > > tutorials and more. Take a look and join the conversation now. > > > http://goparallel.sourceforge.net/ > > > _______________________________________________ > > > Opensaf-users mailing list > > > [email protected]<mailto:[email protected] > > > eforge.net> > > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > > > > ---------------------------------------------------------------------- > > -------- Dive into the World of Parallel Programming The Go Parallel > > Website, sponsored by Intel and developed in partnership with Slashdot > > Media, is your hub for all things parallel software development, from > > weekly thought leadership blogs to news, videos, case studies, > > tutorials and more. Take a look and join the conversation now. > > http://goparallel.sourceforge.net/ > > _______________________________________________ > > Opensaf-users mailing list > > [email protected]<mailto:[email protected] > > orge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > > ---------------------------------------------------------------------- > > -------- Dive into the World of Parallel Programming The Go Parallel > > Website, sponsored by Intel and developed in partnership with Slashdot > > Media, is your hub for all things parallel software development, from > > weekly thought leadership blogs to news, videos, case studies, > > tutorials and more. Take a look and join the conversation now. > > http://goparallel.sourceforge.net/ > > _______________________________________________ > > Opensaf-users mailing list > > [email protected]<mailto:[email protected] > > orge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-users > > ------------------------------ > > > > ________________________________ > The information transmitted herein is intended only for the person or entity > to which it is addressed and may contain confidential, proprietary and/or > privileged material. Any review, retransmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in > error, please contact the sender and delete the material from any computer. > > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
