Hi,

You could try the fix in this ticket 
https://sourceforge.net/p/opensaf/tickets/2014/ and see if the scenario is the 
same
The patch In 
https://sourceforge.net/p/opensaf/staging/ci/b30d5e33e50c7eea8cc1730cbe0a0dde572621f0/

Thanks,
Mathi.


> -----Original Message-----
> From: Shu Wang [mailto:[email protected]]
> Sent: Saturday, June 20, 2015 1:50 AM
> To: [email protected]
> Subject: Re: [users] how long it takes to detect node sudden power
> 
> We have a similar scenario. One of our payload node rebooted, it took from a
> few seconds to a few minutes for other nodes to detect the node loss. Since
> it took the master controller a few minutes to detect the node loss and
> reacted to the loss, this caused serious problems and many service units
> went bad. Is there anyway to improve the detection time?
> 
> Thank you!
> 
> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917|
> www.NetCracker.com Proven Partner to Communications Service Providers
> 
> -----Original Message-----
> Message: 3
> Date: Tue, 14 Apr 2015 09:58:51 +0000
> From: Yao Cheng LIANG <[email protected]>
> Subject: Re: [users] how long it takes to detect node sudden power
>         loss
> To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan
>         Palanivelu      <[email protected]>
> Cc: "[email protected]"
>         <[email protected]>
> Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1>
> Content-Type: text/plain; charset="windows-1255"
> 
> Let me give more info about my setup:
> 
> 
> 1.       I have two node, running as controller
> 
> 2.       Besides OpenSAF service, I have another service unit with three
> component in it
> 
> 3.       These components use Checkpoint service to data synchronization
> 
> 
> 
> My dtmd.conf is as below:
> 
> ?
> 
> DTM_INI_DIS_TIMEOUT_SECS=5
> 
> 
> 
> DTM_TCP_KEEPIDLE_TIME=2
> 
> 
> 
> DTM_TCP_KEEPALIVE_INTVL=1
> 
> 
> 
> DTM_TCP_KEEPALIVE_PROBES=2
> 
> 
> 
> I read the code and found it is using TCP keepalive to detect failure of peer
> node. While keepalive packet will not be send until some time after the link 
> is
> IDLE. I think the issue is here. Suppose ?standby? node is sending something
> to ?active? node, while at this time ?active? node is rebooted, ?standby?
> node will keeping sending this until it reaches maximum retries. In this
> period, the link will not be idel, thus the keepalive mechanism will not 
> start to
> work. This may cause ?standby? node long time to detect failure of ?active?
> node.
> 
> Thanks.
> 
> 
> 
> Ted
> 
> 
> 
> 
> 
> From: A V Mahesh [mailto:[email protected]]
> Sent: Monday, April 13, 2015 10:06 PM
> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu
> Cc: [email protected]
> Subject: Re: [users] how long it takes to detect node sudden power loss
> 
> Hi,
> 
> Un-comment the below line to enable trace of osafdtm in
> /etc/opensaf/dtmd.conf
> 
> #args="--tracemask=0xffffffff"   ------>  args="--tracemask=0xffffffff"
> 
> And do  `export MDS_LOG_LEVEL=5` on both node consoles before
> `/etc/init.d/opensafd restart` to get debuig MDS logs.
> 
> 
> -AVM
> 
> On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote:
> Dear AVM,
> 
> Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so that
> /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, right?
> 
> Ted
> 
> From: A V Mahesh [mailto:[email protected]]
> Sent: Monday, April 13, 2015 1:03 PM
> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu
> Cc: [email protected]<mailto:opensaf-
> [email protected]>
> Subject: Re: [users] how long it takes to detect node sudden power loss
> 
> Hi Ted,
> 
> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote:
> I did 3o times rebooting ?standby? node, and found two times it needs 1~2
> minutes for the ?active? node to detect it
> 
> Can you please share the  following data of both nodes when ?active? node
> detection of standby taken 1~2 minutes.
> 
> 1) #/var/log/opensaf/osafdtm
> 2) #/var/log/opensaf/mds.log
> 3) #/var/log/messages ( syslog )
> 
> 4) #top    (output at the time of detection)
> 5) /etc/opensaf/dtmd.conf
> 
> -AVM
> 
> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote:
> I did some tests recently. I have two controllers, and I reboot one and see
> how long the second could detect failure of the peer. I did 3o times
> rebooting ?standby? node, and found two times it needs 1~2 minutes for the
> ?active? node to detect it. Could you anyone tell me the reason and the
> solution?
> 
> Thanks.
> 
> Ted
> 
> Sent from Windows Mail
> 
> From: Mathivanan Naickan Palanivelu<mailto:[email protected]>
> Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM
> To: Yao Cheng LIANG<mailto:[email protected]>
> Cc: [email protected]<mailto:opensaf-
> [email protected]>, 'A V
> Mahesh'<mailto:[email protected]>
> 
> I think since these are TCP keepalive configuration values, the connection
> loss would be detected immediatey in the cases of abrupt powershutdown
> or cable unplug.
> 
> Thanks,
> Mathi.
> 
> ----- [email protected]<mailto:[email protected]> wrote:
> 
> > Is there any approach to hasten this detection, because 4 seconds is
> > too long for some use cases?
> >
> > Br,
> >
> > Ted
> >
> > -----Original Message-----
> > From: A V Mahesh [mailto:[email protected]]
> > Sent: Monday, March 30, 2015 12:29 PM
> > To:
> > [email protected]<mailto:[email protected]
> > orge.net>
> > Subject: Re: [users] how long it takes to detect node sudden power
> > loss
> >
> > Hi,
> >
> >  >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect the
> > node connection loss if I suddenly unplug power supply of one node?
> > Yes,when the connection goes down (  disconnect the cable/unplug power
> > supply )  in 4 seconds detect that the connection has been lost
> >
> >   -AVM
> >
> > On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote:
> > > Dear all,
> > >
> > > If using tcp, the underlying dtms using tcp keepalive to detect
> > connection loss. If my dtmd.conf is as below:
> > >
> > > DTM_TCP_KEEPIDLE_TIME=2
> > >
> > > DTM_TCP_KEEPALIVE_INTVL=1
> > >
> > > DTM_TCP_KEEPALIVE_PROBES=2
> > >
> > > Does that mean it needs 2 + 2*1 = 4s before the peer can detect the
> > node connection loss if I suddenly unplug power supply of one node?
> > >
> > > Thanks.
> > >
> > > Ted
> > >
> > >
> > ----------------------------------------------------------------------
> > > -------- Dive into the World of Parallel Programming The Go Parallel
> >
> > > Website, sponsored by Intel and developed in partnership with
> > Slashdot
> > > Media, is your hub for all things parallel software development,
> > from
> > > weekly thought leadership blogs to news, videos, case studies,
> > > tutorials and more. Take a look and join the conversation now.
> > > http://goparallel.sourceforge.net/
> > > _______________________________________________
> > > Opensaf-users mailing list
> > > [email protected]<mailto:[email protected]
> > > eforge.net>
> > > https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >
> >
> > ----------------------------------------------------------------------
> > -------- Dive into the World of Parallel Programming The Go Parallel
> > Website, sponsored by Intel and developed in partnership with Slashdot
> > Media, is your hub for all things parallel software development, from
> > weekly thought leadership blogs to news, videos, case studies,
> > tutorials and more. Take a look and join the conversation now.
> > http://goparallel.sourceforge.net/
> > _______________________________________________
> > Opensaf-users mailing list
> > [email protected]<mailto:[email protected]
> > orge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >
> > ----------------------------------------------------------------------
> > -------- Dive into the World of Parallel Programming The Go Parallel
> > Website, sponsored by Intel and developed in partnership with Slashdot
> > Media, is your hub for all things parallel software development, from
> > weekly thought leadership blogs to news, videos, case studies,
> > tutorials and more. Take a look and join the conversation now.
> > http://goparallel.sourceforge.net/
> > _______________________________________________
> > Opensaf-users mailing list
> > [email protected]<mailto:[email protected]
> > orge.net> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> 
> ------------------------------
> 
> 
> 
> ________________________________
> The information transmitted herein is intended only for the person or entity
> to which it is addressed and may contain confidential, proprietary and/or
> privileged material. Any review, retransmission, dissemination or other use
> of, or taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received 
> this in
> error, please contact the sender and delete the material from any computer.
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to