Yes we were experimenting with the tcp_retries2 option, but the solution 
we ended up with was to use the TCP_USER_TIMEOUT socket option.

regards,

Anders Widell


On 09/15/2016 03:13 PM, Nivrutti Kale wrote:
> Hi,
>
> There is one way to improve the detection time. You can change the " 
> net.ipv4.tcp_retries2"  value to 3.
> Default value of " net.ipv4.tcp_retries2" is 15.
>
> Thanks,
> Nivrutti
>
> -----Original Message-----
> From: Mathivanan Naickan Palanivelu [mailto:[email protected]]
> Sent: Thursday, September 15, 2016 6:38 PM
> To: Shu Wang <[email protected]>; [email protected]
> Subject: Re: [users] how long it takes to detect node sudden power
>
> Hi,
>
> You could try the fix in this ticket 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_tickets_2014_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=gSGrK2pteB9mnPgovHNo3qsOXF0w9s77wt4nUXOHt4o&e=
>   and see if the scenario is the same The patch In 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_staging_ci_b30d5e33e50c7eea8cc1730cbe0a0dde572621f0_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=UTa3tlpHkkLFWQGUlegcxS3Y6JFlHiW2Yfx1bCbKcTM&e=
>
> Thanks,
> Mathi.
>
>
>> -----Original Message-----
>> From: Shu Wang [mailto:[email protected]]
>> Sent: Saturday, June 20, 2015 1:50 AM
>> To: [email protected]
>> Subject: Re: [users] how long it takes to detect node sudden power
>>
>> We have a similar scenario. One of our payload node rebooted, it took
>> from a few seconds to a few minutes for other nodes to detect the node
>> loss. Since it took the master controller a few minutes to detect the
>> node loss and reacted to the loss, this caused serious problems and
>> many service units went bad. Is there anyway to improve the detection time?
>>
>> Thank you!
>>
>> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917|
>> www.NetCracker.com Proven Partner to Communications Service Providers
>>
>> -----Original Message-----
>> Message: 3
>> Date: Tue, 14 Apr 2015 09:58:51 +0000
>> From: Yao Cheng LIANG <[email protected]>
>> Subject: Re: [users] how long it takes to detect node sudden power
>>          loss
>> To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan
>>          Palanivelu      <[email protected]>
>> Cc: "[email protected]"
>>          <[email protected]>
>> Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1>
>> Content-Type: text/plain; charset="windows-1255"
>>
>> Let me give more info about my setup:
>>
>>
>> 1.       I have two node, running as controller
>>
>> 2.       Besides OpenSAF service, I have another service unit with three
>> component in it
>>
>> 3.       These components use Checkpoint service to data synchronization
>>
>>
>>
>> My dtmd.conf is as below:
>>
>> ?
>>
>> DTM_INI_DIS_TIMEOUT_SECS=5
>>
>>
>>
>> DTM_TCP_KEEPIDLE_TIME=2
>>
>>
>>
>> DTM_TCP_KEEPALIVE_INTVL=1
>>
>>
>>
>> DTM_TCP_KEEPALIVE_PROBES=2
>>
>>
>>
>> I read the code and found it is using TCP keepalive to detect failure
>> of peer node. While keepalive packet will not be send until some time
>> after the link is IDLE. I think the issue is here. Suppose ?standby?
>> node is sending something to ?active? node, while at this time ?active? node 
>> is rebooted, ?standby?
>> node will keeping sending this until it reaches maximum retries. In
>> this period, the link will not be idel, thus the keepalive mechanism
>> will not start to work. This may cause ?standby? node long time to detect 
>> failure of ?active?
>> node.
>>
>> Thanks.
>>
>>
>>
>> Ted
>>
>>
>>
>>
>>
>> From: A V Mahesh [mailto:[email protected]]
>> Sent: Monday, April 13, 2015 10:06 PM
>> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu
>> Cc: [email protected]
>> Subject: Re: [users] how long it takes to detect node sudden power
>> loss
>>
>> Hi,
>>
>> Un-comment the below line to enable trace of osafdtm in
>> /etc/opensaf/dtmd.conf
>>
>> #args="--tracemask=0xffffffff"   ------>  args="--tracemask=0xffffffff"
>>
>> And do  `export MDS_LOG_LEVEL=5` on both node consoles before
>> `/etc/init.d/opensafd restart` to get debuig MDS logs.
>>
>>
>> -AVM
>>
>> On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote:
>> Dear AVM,
>>
>> Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so
>> that /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, 
>> right?
>>
>> Ted
>>
>> From: A V Mahesh [mailto:[email protected]]
>> Sent: Monday, April 13, 2015 1:03 PM
>> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu
>> Cc: [email protected]<mailto:opensaf-
>> [email protected]>
>> Subject: Re: [users] how long it takes to detect node sudden power
>> loss
>>
>> Hi Ted,
>>
>> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote:
>> I did 3o times rebooting ?standby? node, and found two times it needs
>> 1~2 minutes for the ?active? node to detect it
>>
>> Can you please share the  following data of both nodes when ?active?
>> node detection of standby taken 1~2 minutes.
>>
>> 1) #/var/log/opensaf/osafdtm
>> 2) #/var/log/opensaf/mds.log
>> 3) #/var/log/messages ( syslog )
>>
>> 4) #top    (output at the time of detection)
>> 5) /etc/opensaf/dtmd.conf
>>
>> -AVM
>>
>> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote:
>> I did some tests recently. I have two controllers, and I reboot one
>> and see how long the second could detect failure of the peer. I did 3o
>> times rebooting ?standby? node, and found two times it needs 1~2
>> minutes for the ?active? node to detect it. Could you anyone tell me
>> the reason and the solution?
>>
>> Thanks.
>>
>> Ted
>>
>> Sent from Windows Mail
>>
>> From: Mathivanan Naickan Palanivelu<mailto:[email protected]>
>> Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM
>> To: Yao Cheng LIANG<mailto:[email protected]>
>> Cc: [email protected]<mailto:opensaf-
>> [email protected]>, 'A V
>> Mahesh'<mailto:[email protected]>
>>
>> I think since these are TCP keepalive configuration values, the
>> connection loss would be detected immediatey in the cases of abrupt
>> powershutdown or cable unplug.
>>
>> Thanks,
>> Mathi.
>>
>> ----- [email protected]<mailto:[email protected]> wrote:
>>
>>> Is there any approach to hasten this detection, because 4 seconds is
>>> too long for some use cases?
>>>
>>> Br,
>>>
>>> Ted
>>>
>>> -----Original Message-----
>>> From: A V Mahesh [mailto:[email protected]]
>>> Sent: Monday, March 30, 2015 12:29 PM
>>> To:
>>> [email protected]<mailto:[email protected]
>>> ef
>>> orge.net>
>>> Subject: Re: [users] how long it takes to detect node sudden power
>>> loss
>>>
>>> Hi,
>>>
>>>   >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect
>>> the node connection loss if I suddenly unplug power supply of one node?
>>> Yes,when the connection goes down (  disconnect the cable/unplug
>>> power supply )  in 4 seconds detect that the connection has been
>>> lost
>>>
>>>    -AVM
>>>
>>> On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote:
>>>> Dear all,
>>>>
>>>> If using tcp, the underlying dtms using tcp keepalive to detect
>>> connection loss. If my dtmd.conf is as below:
>>>> DTM_TCP_KEEPIDLE_TIME=2
>>>>
>>>> DTM_TCP_KEEPALIVE_INTVL=1
>>>>
>>>> DTM_TCP_KEEPALIVE_PROBES=2
>>>>
>>>> Does that mean it needs 2 + 2*1 = 4s before the peer can detect
>>>> the
>>> node connection loss if I suddenly unplug power supply of one node?
>>>> Thanks.
>>>>
>>>> Ted
>>>>
>>>>
>>> --------------------------------------------------------------------
>>> --
>>>> -------- Dive into the World of Parallel Programming The Go
>>>> Parallel
>>>> Website, sponsored by Intel and developed in partnership with
>>> Slashdot
>>>> Media, is your hub for all things parallel software development,
>>> from
>>>> weekly thought leadership blogs to news, videos, case studies,
>>>> tutorials and more. Take a look and join the conversation now.
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sou
>>>> rceforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N6
>>>> 7rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI
>>>> -fnO-gw&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e=
>>>> _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]<mailto:[email protected]
>>>> rc
>>>> eforge.net>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcef
>>>> orge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqI
>>>> Ni2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOB
>>>> BSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDf
>>>> hJtPItghKLab0&e=
>>>
>>> --------------------------------------------------------------------
>>> --
>>> -------- Dive into the World of Parallel Programming The Go Parallel
>>> Website, sponsored by Intel and developed in partnership with
>>> Slashdot Media, is your hub for all things parallel software
>>> development, from weekly thought leadership blogs to news, videos,
>>> case studies, tutorials and more. Take a look and join the conversation now.
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc
>>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE
>>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g
>>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e=
>>> _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]<mailto:[email protected]
>>> ef orge.net>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor
>>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j
>>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P
>>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh
>>> KLab0&e=
>>>
>>> --------------------------------------------------------------------
>>> --
>>> -------- Dive into the World of Parallel Programming The Go Parallel
>>> Website, sponsored by Intel and developed in partnership with
>>> Slashdot Media, is your hub for all things parallel software
>>> development, from weekly thought leadership blogs to news, videos,
>>> case studies, tutorials and more. Take a look and join the conversation now.
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc
>>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE
>>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g
>>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e=
>>> _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]<mailto:[email protected]
>>> ef orge.net>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor
>>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j
>>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P
>>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh
>>> KLab0&e=
>> ------------------------------
>>
>>
>>
>> ________________________________
>> The information transmitted herein is intended only for the person or
>> entity to which it is addressed and may contain confidential,
>> proprietary and/or privileged material. Any review, retransmission,
>> dissemination or other use of, or taking of any action in reliance
>> upon, this information by persons or entities other than the intended
>> recipient is prohibited. If you received this in error, please contact the 
>> sender and delete the material from any computer.
>>
>> ----------------------------------------------------------------------
>> -------- _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge
>> .net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&
>> r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpf
>> RXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e=
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e=
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to