Yes we were experimenting with the tcp_retries2 option, but the solution we ended up with was to use the TCP_USER_TIMEOUT socket option.
regards, Anders Widell On 09/15/2016 03:13 PM, Nivrutti Kale wrote: > Hi, > > There is one way to improve the detection time. You can change the " > net.ipv4.tcp_retries2" value to 3. > Default value of " net.ipv4.tcp_retries2" is 15. > > Thanks, > Nivrutti > > -----Original Message----- > From: Mathivanan Naickan Palanivelu [mailto:[email protected]] > Sent: Thursday, September 15, 2016 6:38 PM > To: Shu Wang <[email protected]>; [email protected] > Subject: Re: [users] how long it takes to detect node sudden power > > Hi, > > You could try the fix in this ticket > https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_tickets_2014_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=gSGrK2pteB9mnPgovHNo3qsOXF0w9s77wt4nUXOHt4o&e= > and see if the scenario is the same The patch In > https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_staging_ci_b30d5e33e50c7eea8cc1730cbe0a0dde572621f0_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=UTa3tlpHkkLFWQGUlegcxS3Y6JFlHiW2Yfx1bCbKcTM&e= > > Thanks, > Mathi. > > >> -----Original Message----- >> From: Shu Wang [mailto:[email protected]] >> Sent: Saturday, June 20, 2015 1:50 AM >> To: [email protected] >> Subject: Re: [users] how long it takes to detect node sudden power >> >> We have a similar scenario. One of our payload node rebooted, it took >> from a few seconds to a few minutes for other nodes to detect the node >> loss. Since it took the master controller a few minutes to detect the >> node loss and reacted to the loss, this caused serious problems and >> many service units went bad. Is there anyway to improve the detection time? >> >> Thank you! >> >> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| >> www.NetCracker.com Proven Partner to Communications Service Providers >> >> -----Original Message----- >> Message: 3 >> Date: Tue, 14 Apr 2015 09:58:51 +0000 >> From: Yao Cheng LIANG <[email protected]> >> Subject: Re: [users] how long it takes to detect node sudden power >> loss >> To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan >> Palanivelu <[email protected]> >> Cc: "[email protected]" >> <[email protected]> >> Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1> >> Content-Type: text/plain; charset="windows-1255" >> >> Let me give more info about my setup: >> >> >> 1. I have two node, running as controller >> >> 2. Besides OpenSAF service, I have another service unit with three >> component in it >> >> 3. These components use Checkpoint service to data synchronization >> >> >> >> My dtmd.conf is as below: >> >> ? >> >> DTM_INI_DIS_TIMEOUT_SECS=5 >> >> >> >> DTM_TCP_KEEPIDLE_TIME=2 >> >> >> >> DTM_TCP_KEEPALIVE_INTVL=1 >> >> >> >> DTM_TCP_KEEPALIVE_PROBES=2 >> >> >> >> I read the code and found it is using TCP keepalive to detect failure >> of peer node. While keepalive packet will not be send until some time >> after the link is IDLE. I think the issue is here. Suppose ?standby? >> node is sending something to ?active? node, while at this time ?active? node >> is rebooted, ?standby? >> node will keeping sending this until it reaches maximum retries. In >> this period, the link will not be idel, thus the keepalive mechanism >> will not start to work. This may cause ?standby? node long time to detect >> failure of ?active? >> node. >> >> Thanks. >> >> >> >> Ted >> >> >> >> >> >> From: A V Mahesh [mailto:[email protected]] >> Sent: Monday, April 13, 2015 10:06 PM >> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu >> Cc: [email protected] >> Subject: Re: [users] how long it takes to detect node sudden power >> loss >> >> Hi, >> >> Un-comment the below line to enable trace of osafdtm in >> /etc/opensaf/dtmd.conf >> >> #args="--tracemask=0xffffffff" ------> args="--tracemask=0xffffffff" >> >> And do `export MDS_LOG_LEVEL=5` on both node consoles before >> `/etc/init.d/opensafd restart` to get debuig MDS logs. >> >> >> -AVM >> >> On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote: >> Dear AVM, >> >> Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so >> that /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, >> right? >> >> Ted >> >> From: A V Mahesh [mailto:[email protected]] >> Sent: Monday, April 13, 2015 1:03 PM >> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu >> Cc: [email protected]<mailto:opensaf- >> [email protected]> >> Subject: Re: [users] how long it takes to detect node sudden power >> loss >> >> Hi Ted, >> >> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: >> I did 3o times rebooting ?standby? node, and found two times it needs >> 1~2 minutes for the ?active? node to detect it >> >> Can you please share the following data of both nodes when ?active? >> node detection of standby taken 1~2 minutes. >> >> 1) #/var/log/opensaf/osafdtm >> 2) #/var/log/opensaf/mds.log >> 3) #/var/log/messages ( syslog ) >> >> 4) #top (output at the time of detection) >> 5) /etc/opensaf/dtmd.conf >> >> -AVM >> >> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: >> I did some tests recently. I have two controllers, and I reboot one >> and see how long the second could detect failure of the peer. I did 3o >> times rebooting ?standby? node, and found two times it needs 1~2 >> minutes for the ?active? node to detect it. Could you anyone tell me >> the reason and the solution? >> >> Thanks. >> >> Ted >> >> Sent from Windows Mail >> >> From: Mathivanan Naickan Palanivelu<mailto:[email protected]> >> Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM >> To: Yao Cheng LIANG<mailto:[email protected]> >> Cc: [email protected]<mailto:opensaf- >> [email protected]>, 'A V >> Mahesh'<mailto:[email protected]> >> >> I think since these are TCP keepalive configuration values, the >> connection loss would be detected immediatey in the cases of abrupt >> powershutdown or cable unplug. >> >> Thanks, >> Mathi. >> >> ----- [email protected]<mailto:[email protected]> wrote: >> >>> Is there any approach to hasten this detection, because 4 seconds is >>> too long for some use cases? >>> >>> Br, >>> >>> Ted >>> >>> -----Original Message----- >>> From: A V Mahesh [mailto:[email protected]] >>> Sent: Monday, March 30, 2015 12:29 PM >>> To: >>> [email protected]<mailto:[email protected] >>> ef >>> orge.net> >>> Subject: Re: [users] how long it takes to detect node sudden power >>> loss >>> >>> Hi, >>> >>> >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect >>> the node connection loss if I suddenly unplug power supply of one node? >>> Yes,when the connection goes down ( disconnect the cable/unplug >>> power supply ) in 4 seconds detect that the connection has been >>> lost >>> >>> -AVM >>> >>> On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote: >>>> Dear all, >>>> >>>> If using tcp, the underlying dtms using tcp keepalive to detect >>> connection loss. If my dtmd.conf is as below: >>>> DTM_TCP_KEEPIDLE_TIME=2 >>>> >>>> DTM_TCP_KEEPALIVE_INTVL=1 >>>> >>>> DTM_TCP_KEEPALIVE_PROBES=2 >>>> >>>> Does that mean it needs 2 + 2*1 = 4s before the peer can detect >>>> the >>> node connection loss if I suddenly unplug power supply of one node? >>>> Thanks. >>>> >>>> Ted >>>> >>>> >>> -------------------------------------------------------------------- >>> -- >>>> -------- Dive into the World of Parallel Programming The Go >>>> Parallel >>>> Website, sponsored by Intel and developed in partnership with >>> Slashdot >>>> Media, is your hub for all things parallel software development, >>> from >>>> weekly thought leadership blogs to news, videos, case studies, >>>> tutorials and more. Take a look and join the conversation now. >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sou >>>> rceforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N6 >>>> 7rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI >>>> -fnO-gw&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= >>>> _______________________________________________ >>>> Opensaf-users mailing list >>>> [email protected]<mailto:[email protected] >>>> rc >>>> eforge.net> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcef >>>> orge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqI >>>> Ni2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOB >>>> BSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDf >>>> hJtPItghKLab0&e= >>> >>> -------------------------------------------------------------------- >>> -- >>> -------- Dive into the World of Parallel Programming The Go Parallel >>> Website, sponsored by Intel and developed in partnership with >>> Slashdot Media, is your hub for all things parallel software >>> development, from weekly thought leadership blogs to news, videos, >>> case studies, tutorials and more. Take a look and join the conversation now. >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc >>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE >>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g >>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= >>> _______________________________________________ >>> Opensaf-users mailing list >>> [email protected]<mailto:[email protected] >>> ef orge.net> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor >>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j >>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P >>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh >>> KLab0&e= >>> >>> -------------------------------------------------------------------- >>> -- >>> -------- Dive into the World of Parallel Programming The Go Parallel >>> Website, sponsored by Intel and developed in partnership with >>> Slashdot Media, is your hub for all things parallel software >>> development, from weekly thought leadership blogs to news, videos, >>> case studies, tutorials and more. Take a look and join the conversation now. >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc >>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE >>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g >>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= >>> _______________________________________________ >>> Opensaf-users mailing list >>> [email protected]<mailto:[email protected] >>> ef orge.net> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor >>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j >>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P >>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh >>> KLab0&e= >> ------------------------------ >> >> >> >> ________________________________ >> The information transmitted herein is intended only for the person or >> entity to which it is addressed and may contain confidential, >> proprietary and/or privileged material. Any review, retransmission, >> dissemination or other use of, or taking of any action in reliance >> upon, this information by persons or entities other than the intended >> recipient is prohibited. If you received this in error, please contact the >> sender and delete the material from any computer. >> >> ---------------------------------------------------------------------- >> -------- _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge >> .net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg& >> r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpf >> RXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e= > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e= > > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
