Re: 2 minute gateway startup time due to GEODE-5591
OK, after discussion with Jason and Ryan. a PR #2425 is ready. It contains fix for 3 issues, including skipping the 2-minute-timeout. On Wed, Sep 5, 2018 at 11:03 AM, Udo Kohlmeyer wrote: > +1 > > > > On 9/5/18 10:35, Anthony Baker wrote: > >> Before this improvement is re-merged I’d like to see: >> >> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 >> min when there’s a port conflict) >> 2) A test that demonstrates how the current logic is insufficient >> >> Anthony >> >> >> On Sep 5, 2018, at 10:20 AM, Nabarun Nag wrote: >>> >>> GEODE-5591 has been reverted in develop >>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330 >>> >>> Regards >>> Nabarun Nag >>> >>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon >>> wrote: >>> >>> +1 for reverting in both places. I see that there is already an isGatewayReceiver flag in the AcceptorImpl constructor. It's not ideal, but could we use this flag to prevent the 2 minute retry logic for happening if this flag is true? Ryan On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < lhughesgodf...@pivotal.io> wrote: +1 for reverting in both places. > > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > > +1 for reverting in both places. The current fix is not better, that's >> > why > >> we are reverting it on the release branch! >> >> -Dan >> >> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett >> > wrote: > >> I’m not ok with reverting in develop. Revert in 1.7 and modify in >>> >> develop. >> >>> We shouldn’t go backwards in develop. The current fix is better than >>> >> the > >> bug it fixes. >>> >>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: If everyone is okay with it, I will revert that change in develop >>> and > then >>> cherry pick it to release/1.7.0 branch. Please do comment. Regards Nabarun Nag On Wed, Sep 5, 2018 at 9:30 AM Dan Smith > wrote: > +1 to yank it and rework the fix. > > Gester's change helps, but it just means that you will sometimes > randomly >>> have a 2 minute delay starting up a gateway receiver. I don't > think > that is >>> a great user experience either. > > -Dan > > On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > bschucha...@pivotal.io> >>> wrote: > > Let's yank it >> >> >> >> On 9/4/18 5:04 PM, Sean Goller wrote: >>> >>> If it's to get the release out, I'm fine with reverting. I don't >>> >> like >> >>> it, > >> but I'm not willing to die on that hill. :) >>> >>> -S. >>> >>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith >>> >> wrote: > >> Spitting this into a separate thread. >>> I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in >>> GatewayReceiver.start() >>> from GEODE-5591. That code is trying to use CacheServer.start to >>> scan > for >> >>> an >>> available port, trying each port in a range. That free port >>> finding > >> logic > >> really doesn't want to have two minutes of retries for each >>> port. > It >> >>> seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, >>> or > should >>> we just revert it? Have we switched concourse over to using alpine >>> linux, >>> which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith >>> wrote: >> >>> Why is it waiting at all in this case? Where is this 2 minute >>> timeout >> >>> coming from? > > -Dan > > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > > sai.boorlaga...@gmail.com wrote: > >> So the issue is that it takes longer to start than previous >> > releases? >>> Also, is this wait time only when using Gfsh to create
Re: 2 minute gateway startup time due to GEODE-5591
+1 On 9/5/18 10:35, Anthony Baker wrote: Before this improvement is re-merged I’d like to see: 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min when there’s a port conflict) 2) A test that demonstrates how the current logic is insufficient Anthony On Sep 5, 2018, at 10:20 AM, Nabarun Nag wrote: GEODE-5591 has been reverted in develop ref: 901da27f227a8ce2b7d6b681619782a1accd9330 Regards Nabarun Nag On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon wrote: +1 for reverting in both places. I see that there is already an isGatewayReceiver flag in the AcceptorImpl constructor. It's not ideal, but could we use this flag to prevent the 2 minute retry logic for happening if this flag is true? Ryan On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < lhughesgodf...@pivotal.io> wrote: +1 for reverting in both places. On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: +1 for reverting in both places. The current fix is not better, that's why we are reverting it on the release branch! -Dan On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett wrote: I’m not ok with reverting in develop. Revert in 1.7 and modify in develop. We shouldn’t go backwards in develop. The current fix is better than the bug it fixes. On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: If everyone is okay with it, I will revert that change in develop and then cherry pick it to release/1.7.0 branch. Please do comment. Regards Nabarun Nag On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: +1 to yank it and rework the fix. Gester's change helps, but it just means that you will sometimes randomly have a 2 minute delay starting up a gateway receiver. I don't think that is a great user experience either. -Dan On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < bschucha...@pivotal.io> wrote: Let's yank it On 9/4/18 5:04 PM, Sean Goller wrote: If it's to get the release out, I'm fine with reverting. I don't like it, but I'm not willing to die on that hill. :) -S. On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan for an available port, trying each port in a range. That free port finding logic really doesn't want to have two minutes of retries for each port. It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or should we just revert it? Have we switched concourse over to using alpine linux, which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < sai.boorlaga...@gmail.com wrote: So the issue is that it takes longer to start than previous releases? Also, is this wait time only when using Gfsh to create gateway-receiver? On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: Currently we have a minor issue in the release branch as pointed out by Barry O. We will wait till a resolution is figured out for this issue. Steps: 1. create locator 2. start server --name=server1 --server-port=40404 3. start server --name=server2 --server-port=40405 4. create gateway-receiver --member=server1 5. create gateway-receiver --member=server2 `This gets stuck for 2 minutes` Is the 2 minute wait time acceptable? Should we document it? When we revert GEODE-5591, this issue does not happen. Regards Nabarun Nag
Re: 2 minute gateway startup time due to GEODE-5591
Thank you. I must have missed that :) On 9/5/18 10:54, Nabarun Nag wrote: @Udo I have mentioned in an earlier mail that it will be reverted in develop and then cherry picked to develop. release/1.7.0 branch has not being published yet, as it is undergoing preliminary tests before release candidate is published. Regards Nabarun Nag On Wed, Sep 5, 2018 at 10:46 AM Udo Kohlmeyer wrote: Did we also revert this in 1.7? I assume it has, but not directly stated here. On 9/5/18 10:20, Nabarun Nag wrote: GEODE-5591 has been reverted in develop ref: 901da27f227a8ce2b7d6b681619782a1accd9330 Regards Nabarun Nag On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon wrote: +1 for reverting in both places. I see that there is already an isGatewayReceiver flag in the AcceptorImpl constructor. It's not ideal, but could we use this flag to prevent the 2 minute retry logic for happening if this flag is true? Ryan On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < lhughesgodf...@pivotal.io> wrote: +1 for reverting in both places. On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: +1 for reverting in both places. The current fix is not better, that's why we are reverting it on the release branch! -Dan On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett wrote: I’m not ok with reverting in develop. Revert in 1.7 and modify in develop. We shouldn’t go backwards in develop. The current fix is better than the bug it fixes. On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: If everyone is okay with it, I will revert that change in develop and then cherry pick it to release/1.7.0 branch. Please do comment. Regards Nabarun Nag On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: +1 to yank it and rework the fix. Gester's change helps, but it just means that you will sometimes randomly have a 2 minute delay starting up a gateway receiver. I don't think that is a great user experience either. -Dan On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < bschucha...@pivotal.io> wrote: Let's yank it On 9/4/18 5:04 PM, Sean Goller wrote: If it's to get the release out, I'm fine with reverting. I don't like it, but I'm not willing to die on that hill. :) -S. On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan for an available port, trying each port in a range. That free port finding logic really doesn't want to have two minutes of retries for each port. It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or should we just revert it? Have we switched concourse over to using alpine linux, which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < sai.boorlaga...@gmail.com wrote: So the issue is that it takes longer to start than previous releases? Also, is this wait time only when using Gfsh to create gateway-receiver? On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: Currently we have a minor issue in the release branch as pointed out by Barry O. We will wait till a resolution is figured out for this issue. Steps: 1. create locator 2. start server --name=server1 --server-port=40404 3. start server --name=server2 --server-port=40405 4. create gateway-receiver --member=server1 5. create gateway-receiver --member=server2 `This gets stuck for 2 minutes` Is the 2 minute wait time acceptable? Should we document it? When we revert GEODE-5591, this issue does not happen. Regards Nabarun Nag
Re: 2 minute gateway startup time due to GEODE-5591
*correction: cherry picked to release/1.7.0 On Wed, Sep 5, 2018 at 10:54 AM Nabarun Nag wrote: > @Udo I have mentioned in an earlier mail that it will be reverted in > develop and then cherry picked to develop. release/1.7.0 branch has not > being published yet, as it is undergoing preliminary tests before release > candidate is published. > > Regards > Nabarun Nag > > On Wed, Sep 5, 2018 at 10:46 AM Udo Kohlmeyer wrote: > >> Did we also revert this in 1.7? I assume it has, but not directly stated >> here. >> >> >> On 9/5/18 10:20, Nabarun Nag wrote: >> > GEODE-5591 has been reverted in develop >> > ref: 901da27f227a8ce2b7d6b681619782a1accd9330 >> > >> > Regards >> > Nabarun Nag >> > >> > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon >> wrote: >> > >> >> +1 for reverting in both places. >> >> >> >> I see that there is already an isGatewayReceiver flag in the >> AcceptorImpl >> >> constructor. It's not ideal, but could we use this flag to prevent >> the 2 >> >> minute retry logic for happening if this flag is true? >> >> >> >> Ryan >> >> >> >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < >> >> lhughesgodf...@pivotal.io> wrote: >> >> >> >>> +1 for reverting in both places. >> >>> >> >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: >> >>> >> +1 for reverting in both places. The current fix is not better, >> that's >> >>> why >> we are reverting it on the release branch! >> >> -Dan >> >> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett >> >>> wrote: >> > I’m not ok with reverting in develop. Revert in 1.7 and modify in >> develop. >> > We shouldn’t go backwards in develop. The current fix is better than >> >>> the >> > bug it fixes. >> > >> >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: >> >> >> >> If everyone is okay with it, I will revert that change in develop >> >> and >> > then >> >> cherry pick it to release/1.7.0 branch. >> >> Please do comment. >> >> >> >> Regards >> >> Nabarun Nag >> >> >> >> >> >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith >> >> wrote: >> >>> +1 to yank it and rework the fix. >> >>> >> >>> Gester's change helps, but it just means that you will sometimes >> > randomly >> >>> have a 2 minute delay starting up a gateway receiver. I don't >> >> think >> > that is >> >>> a great user experience either. >> >>> >> >>> -Dan >> >>> >> >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < >> > bschucha...@pivotal.io> >> >>> wrote: >> >>> >> Let's yank it >> >> >> >> > On 9/4/18 5:04 PM, Sean Goller wrote: >> > >> > If it's to get the release out, I'm fine with reverting. I don't >> like >> >>> it, >> > but I'm not willing to die on that hill. :) >> > >> > -S. >> > >> > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith >> >>> wrote: >> > Spitting this into a separate thread. >> >> I see the issue. The two minute timeout is the constructor for >> >> AcceptorImpl, where it retries to bind for 2 minutes. >> >> >> >> That behavior makes sense for CacheServer.start. >> >> >> >> But it doesn't make sense for the new logic in >> > GatewayReceiver.start() >> >> from >> >> GEODE-5591. That code is trying to use CacheServer.start to >> >> scan >> for >> > an >> >> available port, trying each port in a range. That free port >> >>> finding >> >>> logic >> >> really doesn't want to have two minutes of retries for each >> >> port. >> It >> >> seems >> >> like we need to rework the fix for GEODE-5591. >> >> >> >> Does it make sense to hold up the release to rework this fix, >> >> or >> > should >> >> we >> >> just revert it? Have we switched concourse over to using alpine >> > linux, >> >> which I think was the original motivation for this fix? >> >> >> >> -Dan >> >> >> >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith >> wrote: >> >> Why is it waiting at all in this case? Where is this 2 minute >> timeout >> >>> coming from? >> >>> >> >>> -Dan >> >>> >> >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < >> >>> >> >> sai.boorlaga...@gmail.com >> >> >> >>> wrote: >> So the issue is that it takes longer to start than previous >> > releases? >> Also, is this wait time only when using Gfsh to create >> gateway-receiver? >> >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag >> > wrote: >> Currently we have a minor issue in the release branch as >> >>> pointed >> > out >> by >> >>> Barry O. >> > We will wait till a resolution is
Re: 2 minute gateway startup time due to GEODE-5591
The previous fix did not improve anything on 2-miniute-timeout. On Wed, Sep 5, 2018 at 10:52 AM, Anthony Baker wrote: > Gester, > > Clearly the prior implementation had some problems, but except in > pathological cases it provided the behavior users expected. That’s why I > think we need a characterization test(s) to show exactly what we want the > behavior to be. Merging in changes that make the user experience worse in > the more common scenarios isn’t a good tradeoff IMO. I see this work as > integral to GEODE-5591 and shouldn’t be deferred to a separate ticket. > > Anthony > > > > On Sep 5, 2018, at 10:43 AM, Xiaojian Zhou wrote: > > > > The fix intend to resolve 2 issues: > > 1) change the exception handling (for a linux version). > > 2) prevent random picking port number to loop forever. In old code, for > > example, if the range only contains one port, random will always pick the > > same port and it will loop forever. The fix will stop after all available > > ports in the range are tried. There's a test > > > > test_ValidateGatewayReceiverAttributes_WrongBindAddress > > > > > > For 2-minute-wait, it's still possible. The fix did not resolve it > > (when random() happened to return same port for different receiver in > > the same member), but I did not make things worse either. > > > > > > There's discussion on if we can reduce the 2-minute-timeout to a few > > second. This is definitely another ticket. > > > > Regards > > > > Gester > > > > > > On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker > wrote: > > > >> Before this improvement is re-merged I’d like to see: > >> > >> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 > min > >> when there’s a port conflict) > >> 2) A test that demonstrates how the current logic is insufficient > >> > >> Anthony > >> > >> > >>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag wrote: > >>> > >>> GEODE-5591 has been reverted in develop > >>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330 > >>> > >>> Regards > >>> Nabarun Nag > >>> > >>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon > >> wrote: > >>> > +1 for reverting in both places. > > I see that there is already an isGatewayReceiver flag in the > >> AcceptorImpl > constructor. It's not ideal, but could we use this flag to prevent > the > >> 2 > minute retry logic for happening if this flag is true? > > Ryan > > On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > lhughesgodf...@pivotal.io> wrote: > > > +1 for reverting in both places. > > > > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > > > >> +1 for reverting in both places. The current fix is not better, > that's > > why > >> we are reverting it on the release branch! > >> > >> -Dan > >> > >> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > > wrote: > >> > >>> I’m not ok with reverting in develop. Revert in 1.7 and modify in > >> develop. > >>> We shouldn’t go backwards in develop. The current fix is better > than > > the > >>> bug it fixes. > >>> > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > If everyone is okay with it, I will revert that change in develop > and > >>> then > cherry pick it to release/1.7.0 branch. > Please do comment. > > Regards > Nabarun Nag > > > > On Wed, Sep 5, 2018 at 9:30 AM Dan Smith > wrote: > > > > +1 to yank it and rework the fix. > > > > Gester's change helps, but it just means that you will sometimes > >>> randomly > > have a 2 minute delay starting up a gateway receiver. I don't > think > >>> that is > > a great user experience either. > > > > -Dan > > > > On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > >>> bschucha...@pivotal.io> > > wrote: > > > >> Let's yank it > >> > >> > >> > >>> On 9/4/18 5:04 PM, Sean Goller wrote: > >>> > >>> If it's to get the release out, I'm fine with reverting. I > don't > >> like > > it, > >>> but I'm not willing to die on that hill. :) > >>> > >>> -S. > >>> > >>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > > wrote: > >>> > >>> Spitting this into a separate thread. > > I see the issue. The two minute timeout is the constructor for > AcceptorImpl, where it retries to bind for 2 minutes. > > That behavior makes sense for CacheServer.start. > > But it doesn't make sense for the new logic in > >>> GatewayReceiver.start() > from > GEODE-5591. That code is trying to use CacheServer.start to > scan > >> for > >>> an > available port, trying
Re: 2 minute gateway startup time due to GEODE-5591
@Udo I have mentioned in an earlier mail that it will be reverted in develop and then cherry picked to develop. release/1.7.0 branch has not being published yet, as it is undergoing preliminary tests before release candidate is published. Regards Nabarun Nag On Wed, Sep 5, 2018 at 10:46 AM Udo Kohlmeyer wrote: > Did we also revert this in 1.7? I assume it has, but not directly stated > here. > > > On 9/5/18 10:20, Nabarun Nag wrote: > > GEODE-5591 has been reverted in develop > > ref: 901da27f227a8ce2b7d6b681619782a1accd9330 > > > > Regards > > Nabarun Nag > > > > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon > wrote: > > > >> +1 for reverting in both places. > >> > >> I see that there is already an isGatewayReceiver flag in the > AcceptorImpl > >> constructor. It's not ideal, but could we use this flag to prevent the > 2 > >> minute retry logic for happening if this flag is true? > >> > >> Ryan > >> > >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > >> lhughesgodf...@pivotal.io> wrote: > >> > >>> +1 for reverting in both places. > >>> > >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > >>> > +1 for reverting in both places. The current fix is not better, that's > >>> why > we are reverting it on the release branch! > > -Dan > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > >>> wrote: > > I’m not ok with reverting in develop. Revert in 1.7 and modify in > develop. > > We shouldn’t go backwards in develop. The current fix is better than > >>> the > > bug it fixes. > > > >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > >> > >> If everyone is okay with it, I will revert that change in develop > >> and > > then > >> cherry pick it to release/1.7.0 branch. > >> Please do comment. > >> > >> Regards > >> Nabarun Nag > >> > >> > >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith > >> wrote: > >>> +1 to yank it and rework the fix. > >>> > >>> Gester's change helps, but it just means that you will sometimes > > randomly > >>> have a 2 minute delay starting up a gateway receiver. I don't > >> think > > that is > >>> a great user experience either. > >>> > >>> -Dan > >>> > >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > > bschucha...@pivotal.io> > >>> wrote: > >>> > Let's yank it > > > > > On 9/4/18 5:04 PM, Sean Goller wrote: > > > > If it's to get the release out, I'm fine with reverting. I don't > like > >>> it, > > but I'm not willing to die on that hill. :) > > > > -S. > > > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > >>> wrote: > > Spitting this into a separate thread. > >> I see the issue. The two minute timeout is the constructor for > >> AcceptorImpl, where it retries to bind for 2 minutes. > >> > >> That behavior makes sense for CacheServer.start. > >> > >> But it doesn't make sense for the new logic in > > GatewayReceiver.start() > >> from > >> GEODE-5591. That code is trying to use CacheServer.start to > >> scan > for > > an > >> available port, trying each port in a range. That free port > >>> finding > >>> logic > >> really doesn't want to have two minutes of retries for each > >> port. > It > >> seems > >> like we need to rework the fix for GEODE-5591. > >> > >> Does it make sense to hold up the release to rework this fix, > >> or > > should > >> we > >> just revert it? Have we switched concourse over to using alpine > > linux, > >> which I think was the original motivation for this fix? > >> > >> -Dan > >> > >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith > wrote: > >> Why is it waiting at all in this case? Where is this 2 minute > timeout > >>> coming from? > >>> > >>> -Dan > >>> > >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > >>> > >> sai.boorlaga...@gmail.com > >> > >>> wrote: > So the issue is that it takes longer to start than previous > > releases? > Also, is this wait time only when using Gfsh to create > gateway-receiver? > > On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > > wrote: > Currently we have a minor issue in the release branch as > >>> pointed > > out > by > >>> Barry O. > > We will wait till a resolution is figured out for this > >> issue. > > Steps: > > 1. create locator > > 2. start server --name=server1 --server-port=40404 > > 3. start server --name=server2 --server-port=40405 > > 4. create gateway-receiver
Re: 2 minute gateway startup time due to GEODE-5591
Gester, Clearly the prior implementation had some problems, but except in pathological cases it provided the behavior users expected. That’s why I think we need a characterization test(s) to show exactly what we want the behavior to be. Merging in changes that make the user experience worse in the more common scenarios isn’t a good tradeoff IMO. I see this work as integral to GEODE-5591 and shouldn’t be deferred to a separate ticket. Anthony > On Sep 5, 2018, at 10:43 AM, Xiaojian Zhou wrote: > > The fix intend to resolve 2 issues: > 1) change the exception handling (for a linux version). > 2) prevent random picking port number to loop forever. In old code, for > example, if the range only contains one port, random will always pick the > same port and it will loop forever. The fix will stop after all available > ports in the range are tried. There's a test > > test_ValidateGatewayReceiverAttributes_WrongBindAddress > > > For 2-minute-wait, it's still possible. The fix did not resolve it > (when random() happened to return same port for different receiver in > the same member), but I did not make things worse either. > > > There's discussion on if we can reduce the 2-minute-timeout to a few > second. This is definitely another ticket. > > Regards > > Gester > > > On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker wrote: > >> Before this improvement is re-merged I’d like to see: >> >> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min >> when there’s a port conflict) >> 2) A test that demonstrates how the current logic is insufficient >> >> Anthony >> >> >>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag wrote: >>> >>> GEODE-5591 has been reverted in develop >>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330 >>> >>> Regards >>> Nabarun Nag >>> >>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon >> wrote: >>> +1 for reverting in both places. I see that there is already an isGatewayReceiver flag in the >> AcceptorImpl constructor. It's not ideal, but could we use this flag to prevent the >> 2 minute retry logic for happening if this flag is true? Ryan On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < lhughesgodf...@pivotal.io> wrote: > +1 for reverting in both places. > > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > >> +1 for reverting in both places. The current fix is not better, that's > why >> we are reverting it on the release branch! >> >> -Dan >> >> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > wrote: >> >>> I’m not ok with reverting in develop. Revert in 1.7 and modify in >> develop. >>> We shouldn’t go backwards in develop. The current fix is better than > the >>> bug it fixes. >>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: If everyone is okay with it, I will revert that change in develop and >>> then cherry pick it to release/1.7.0 branch. Please do comment. Regards Nabarun Nag > On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: > > +1 to yank it and rework the fix. > > Gester's change helps, but it just means that you will sometimes >>> randomly > have a 2 minute delay starting up a gateway receiver. I don't think >>> that is > a great user experience either. > > -Dan > > On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < >>> bschucha...@pivotal.io> > wrote: > >> Let's yank it >> >> >> >>> On 9/4/18 5:04 PM, Sean Goller wrote: >>> >>> If it's to get the release out, I'm fine with reverting. I don't >> like > it, >>> but I'm not willing to die on that hill. :) >>> >>> -S. >>> >>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > wrote: >>> >>> Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in >>> GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan >> for >>> an available port, trying each port in a range. That free port > finding > logic really doesn't want to have two minutes of retries for each port. >> It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or >>> should we
Re: 2 minute gateway startup time due to GEODE-5591
Did we also revert this in 1.7? I assume it has, but not directly stated here. On 9/5/18 10:20, Nabarun Nag wrote: GEODE-5591 has been reverted in develop ref: 901da27f227a8ce2b7d6b681619782a1accd9330 Regards Nabarun Nag On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon wrote: +1 for reverting in both places. I see that there is already an isGatewayReceiver flag in the AcceptorImpl constructor. It's not ideal, but could we use this flag to prevent the 2 minute retry logic for happening if this flag is true? Ryan On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < lhughesgodf...@pivotal.io> wrote: +1 for reverting in both places. On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: +1 for reverting in both places. The current fix is not better, that's why we are reverting it on the release branch! -Dan On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett wrote: I’m not ok with reverting in develop. Revert in 1.7 and modify in develop. We shouldn’t go backwards in develop. The current fix is better than the bug it fixes. On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: If everyone is okay with it, I will revert that change in develop and then cherry pick it to release/1.7.0 branch. Please do comment. Regards Nabarun Nag On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: +1 to yank it and rework the fix. Gester's change helps, but it just means that you will sometimes randomly have a 2 minute delay starting up a gateway receiver. I don't think that is a great user experience either. -Dan On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < bschucha...@pivotal.io> wrote: Let's yank it On 9/4/18 5:04 PM, Sean Goller wrote: If it's to get the release out, I'm fine with reverting. I don't like it, but I'm not willing to die on that hill. :) -S. On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan for an available port, trying each port in a range. That free port finding logic really doesn't want to have two minutes of retries for each port. It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or should we just revert it? Have we switched concourse over to using alpine linux, which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < sai.boorlaga...@gmail.com wrote: So the issue is that it takes longer to start than previous releases? Also, is this wait time only when using Gfsh to create gateway-receiver? On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: Currently we have a minor issue in the release branch as pointed out by Barry O. We will wait till a resolution is figured out for this issue. Steps: 1. create locator 2. start server --name=server1 --server-port=40404 3. start server --name=server2 --server-port=40405 4. create gateway-receiver --member=server1 5. create gateway-receiver --member=server2 `This gets stuck for 2 minutes` Is the 2 minute wait time acceptable? Should we document it? When we revert GEODE-5591, this issue does not happen. Regards Nabarun Nag
Re: 2 minute gateway startup time due to GEODE-5591
The fix intend to resolve 2 issues: 1) change the exception handling (for a linux version). 2) prevent random picking port number to loop forever. In old code, for example, if the range only contains one port, random will always pick the same port and it will loop forever. The fix will stop after all available ports in the range are tried. There's a test test_ValidateGatewayReceiverAttributes_WrongBindAddress For 2-minute-wait, it's still possible. The fix did not resolve it (when random() happened to return same port for different receiver in the same member), but I did not make things worse either. There's discussion on if we can reduce the 2-minute-timeout to a few second. This is definitely another ticket. Regards Gester On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker wrote: > Before this improvement is re-merged I’d like to see: > > 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min > when there’s a port conflict) > 2) A test that demonstrates how the current logic is insufficient > > Anthony > > > > On Sep 5, 2018, at 10:20 AM, Nabarun Nag wrote: > > > > GEODE-5591 has been reverted in develop > > ref: 901da27f227a8ce2b7d6b681619782a1accd9330 > > > > Regards > > Nabarun Nag > > > > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon > wrote: > > > >> +1 for reverting in both places. > >> > >> I see that there is already an isGatewayReceiver flag in the > AcceptorImpl > >> constructor. It's not ideal, but could we use this flag to prevent the > 2 > >> minute retry logic for happening if this flag is true? > >> > >> Ryan > >> > >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > >> lhughesgodf...@pivotal.io> wrote: > >> > >>> +1 for reverting in both places. > >>> > >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > >>> > +1 for reverting in both places. The current fix is not better, that's > >>> why > we are reverting it on the release branch! > > -Dan > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > >>> wrote: > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in > develop. > > We shouldn’t go backwards in develop. The current fix is better than > >>> the > > bug it fixes. > > > >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > >> > >> If everyone is okay with it, I will revert that change in develop > >> and > > then > >> cherry pick it to release/1.7.0 branch. > >> Please do comment. > >> > >> Regards > >> Nabarun Nag > >> > >> > >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith > >> wrote: > >>> > >>> +1 to yank it and rework the fix. > >>> > >>> Gester's change helps, but it just means that you will sometimes > > randomly > >>> have a 2 minute delay starting up a gateway receiver. I don't > >> think > > that is > >>> a great user experience either. > >>> > >>> -Dan > >>> > >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > > bschucha...@pivotal.io> > >>> wrote: > >>> > Let's yank it > > > > > On 9/4/18 5:04 PM, Sean Goller wrote: > > > > If it's to get the release out, I'm fine with reverting. I don't > like > >>> it, > > but I'm not willing to die on that hill. :) > > > > -S. > > > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > >>> wrote: > > > > Spitting this into a separate thread. > >> > >> I see the issue. The two minute timeout is the constructor for > >> AcceptorImpl, where it retries to bind for 2 minutes. > >> > >> That behavior makes sense for CacheServer.start. > >> > >> But it doesn't make sense for the new logic in > > GatewayReceiver.start() > >> from > >> GEODE-5591. That code is trying to use CacheServer.start to > >> scan > for > > an > >> available port, trying each port in a range. That free port > >>> finding > >>> logic > >> really doesn't want to have two minutes of retries for each > >> port. > It > >> seems > >> like we need to rework the fix for GEODE-5591. > >> > >> Does it make sense to hold up the release to rework this fix, > >> or > > should > >> we > >> just revert it? Have we switched concourse over to using alpine > > linux, > >> which I think was the original motivation for this fix? > >> > >> -Dan > >> > >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith > wrote: > >> > >> Why is it waiting at all in this case? Where is this 2 minute > timeout > >>> coming from? > >>> > >>> -Dan > >>> > >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > >>> > >> sai.boorlaga...@gmail.com > >> > >>> wrote: > So the
Re: 2 minute gateway startup time due to GEODE-5591
Before this improvement is re-merged I’d like to see: 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min when there’s a port conflict) 2) A test that demonstrates how the current logic is insufficient Anthony > On Sep 5, 2018, at 10:20 AM, Nabarun Nag wrote: > > GEODE-5591 has been reverted in develop > ref: 901da27f227a8ce2b7d6b681619782a1accd9330 > > Regards > Nabarun Nag > > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon wrote: > >> +1 for reverting in both places. >> >> I see that there is already an isGatewayReceiver flag in the AcceptorImpl >> constructor. It's not ideal, but could we use this flag to prevent the 2 >> minute retry logic for happening if this flag is true? >> >> Ryan >> >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < >> lhughesgodf...@pivotal.io> wrote: >> >>> +1 for reverting in both places. >>> >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: >>> +1 for reverting in both places. The current fix is not better, that's >>> why we are reverting it on the release branch! -Dan On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett >>> wrote: > I’m not ok with reverting in develop. Revert in 1.7 and modify in develop. > We shouldn’t go backwards in develop. The current fix is better than >>> the > bug it fixes. > >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: >> >> If everyone is okay with it, I will revert that change in develop >> and > then >> cherry pick it to release/1.7.0 branch. >> Please do comment. >> >> Regards >> Nabarun Nag >> >> >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith >> wrote: >>> >>> +1 to yank it and rework the fix. >>> >>> Gester's change helps, but it just means that you will sometimes > randomly >>> have a 2 minute delay starting up a gateway receiver. I don't >> think > that is >>> a great user experience either. >>> >>> -Dan >>> >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > bschucha...@pivotal.io> >>> wrote: >>> Let's yank it > On 9/4/18 5:04 PM, Sean Goller wrote: > > If it's to get the release out, I'm fine with reverting. I don't like >>> it, > but I'm not willing to die on that hill. :) > > -S. > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith >>> wrote: > > Spitting this into a separate thread. >> >> I see the issue. The two minute timeout is the constructor for >> AcceptorImpl, where it retries to bind for 2 minutes. >> >> That behavior makes sense for CacheServer.start. >> >> But it doesn't make sense for the new logic in > GatewayReceiver.start() >> from >> GEODE-5591. That code is trying to use CacheServer.start to >> scan for > an >> available port, trying each port in a range. That free port >>> finding >>> logic >> really doesn't want to have two minutes of retries for each >> port. It >> seems >> like we need to rework the fix for GEODE-5591. >> >> Does it make sense to hold up the release to rework this fix, >> or > should >> we >> just revert it? Have we switched concourse over to using alpine > linux, >> which I think was the original motivation for this fix? >> >> -Dan >> >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: >> >> Why is it waiting at all in this case? Where is this 2 minute timeout >>> coming from? >>> >>> -Dan >>> >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < >>> >> sai.boorlaga...@gmail.com >> >>> wrote: So the issue is that it takes longer to start than previous > releases? Also, is this wait time only when using Gfsh to create gateway-receiver? On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > wrote: Currently we have a minor issue in the release branch as >>> pointed > out > by >> >>> Barry O. > We will wait till a resolution is figured out for this >> issue. > > Steps: > 1. create locator > 2. start server --name=server1 --server-port=40404 > 3. start server --name=server2 --server-port=40405 > 4. create gateway-receiver --member=server1 > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > minutes` > Is the 2 minute wait time acceptable? Should we document it? When > we > revert
Re: 2 minute gateway startup time due to GEODE-5591
Well, I found it's already reverted. But I think we don't have to. After discussed with Jason, I worked out a new fix. It kept previous 5591's intention of exception handling and improved on assigning the port. The port is now checked if available, so it will also resolve 2 minutes timeout issue for the retry. (Or at least will not make things worse). On Wed, Sep 5, 2018 at 10:14 AM, Ryan McMahon wrote: > +1 for reverting in both places. > > I see that there is already an isGatewayReceiver flag in the AcceptorImpl > constructor. It's not ideal, but could we use this flag to prevent the 2 > minute retry logic for happening if this flag is true? > > Ryan > > On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > lhughesgodf...@pivotal.io> wrote: > > > +1 for reverting in both places. > > > > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > > > > > +1 for reverting in both places. The current fix is not better, that's > > why > > > we are reverting it on the release branch! > > > > > > -Dan > > > > > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > > wrote: > > > > > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in > > > develop. > > > > We shouldn’t go backwards in develop. The current fix is better than > > the > > > > bug it fixes. > > > > > > > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > > > > > > > > > If everyone is okay with it, I will revert that change in develop > and > > > > then > > > > > cherry pick it to release/1.7.0 branch. > > > > > Please do comment. > > > > > > > > > > Regards > > > > > Nabarun Nag > > > > > > > > > > > > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith > wrote: > > > > >> > > > > >> +1 to yank it and rework the fix. > > > > >> > > > > >> Gester's change helps, but it just means that you will sometimes > > > > randomly > > > > >> have a 2 minute delay starting up a gateway receiver. I don't > think > > > > that is > > > > >> a great user experience either. > > > > >> > > > > >> -Dan > > > > >> > > > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > > > > bschucha...@pivotal.io> > > > > >> wrote: > > > > >> > > > > >>> Let's yank it > > > > >>> > > > > >>> > > > > >>> > > > > On 9/4/18 5:04 PM, Sean Goller wrote: > > > > > > > > If it's to get the release out, I'm fine with reverting. I don't > > > like > > > > >> it, > > > > but I'm not willing to die on that hill. :) > > > > > > > > -S. > > > > > > > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > > wrote: > > > > > > > > Spitting this into a separate thread. > > > > > > > > > > I see the issue. The two minute timeout is the constructor for > > > > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > > > > > > > That behavior makes sense for CacheServer.start. > > > > > > > > > > But it doesn't make sense for the new logic in > > > > GatewayReceiver.start() > > > > > from > > > > > GEODE-5591. That code is trying to use CacheServer.start to > scan > > > for > > > > an > > > > > available port, trying each port in a range. That free port > > finding > > > > >> logic > > > > > really doesn't want to have two minutes of retries for each > port. > > > It > > > > > seems > > > > > like we need to rework the fix for GEODE-5591. > > > > > > > > > > Does it make sense to hold up the release to rework this fix, > or > > > > should > > > > > we > > > > > just revert it? Have we switched concourse over to using alpine > > > > linux, > > > > > which I think was the original motivation for this fix? > > > > > > > > > > -Dan > > > > > > > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith > > > wrote: > > > > > > > > > > Why is it waiting at all in this case? Where is this 2 minute > > > timeout > > > > >> coming from? > > > > >> > > > > >> -Dan > > > > >> > > > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > > > > >> > > > > > sai.boorlaga...@gmail.com > > > > > > > > > >> wrote: > > > > >>> So the issue is that it takes longer to start than previous > > > > releases? > > > > >>> Also, is this wait time only when using Gfsh to create > > > > >>> gateway-receiver? > > > > >>> > > > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > > > > wrote: > > > > >>> > > > > >>> Currently we have a minor issue in the release branch as > > pointed > > > > out > > > > > > > > >>> by > > > > > > > > > >> Barry O. > > > > We will wait till a resolution is figured out for this > issue. > > > > > > > > Steps: > > > > 1. create locator > > > > 2. start server --name=server1 --server-port=40404 > > > > 3. start server --name=server2 --server-port=40405 > > > > 4. create gateway-receiver --member=server1 > > > > 5. create gateway-receiver --member=server2 `This gets stuck > > > for 2 > >
Re: 2 minute gateway startup time due to GEODE-5591
GEODE-5591 has been reverted in develop ref: 901da27f227a8ce2b7d6b681619782a1accd9330 Regards Nabarun Nag On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon wrote: > +1 for reverting in both places. > > I see that there is already an isGatewayReceiver flag in the AcceptorImpl > constructor. It's not ideal, but could we use this flag to prevent the 2 > minute retry logic for happening if this flag is true? > > Ryan > > On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > lhughesgodf...@pivotal.io> wrote: > > > +1 for reverting in both places. > > > > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > > > > > +1 for reverting in both places. The current fix is not better, that's > > why > > > we are reverting it on the release branch! > > > > > > -Dan > > > > > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > > wrote: > > > > > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in > > > develop. > > > > We shouldn’t go backwards in develop. The current fix is better than > > the > > > > bug it fixes. > > > > > > > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > > > > > > > > > If everyone is okay with it, I will revert that change in develop > and > > > > then > > > > > cherry pick it to release/1.7.0 branch. > > > > > Please do comment. > > > > > > > > > > Regards > > > > > Nabarun Nag > > > > > > > > > > > > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith > wrote: > > > > >> > > > > >> +1 to yank it and rework the fix. > > > > >> > > > > >> Gester's change helps, but it just means that you will sometimes > > > > randomly > > > > >> have a 2 minute delay starting up a gateway receiver. I don't > think > > > > that is > > > > >> a great user experience either. > > > > >> > > > > >> -Dan > > > > >> > > > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > > > > bschucha...@pivotal.io> > > > > >> wrote: > > > > >> > > > > >>> Let's yank it > > > > >>> > > > > >>> > > > > >>> > > > > On 9/4/18 5:04 PM, Sean Goller wrote: > > > > > > > > If it's to get the release out, I'm fine with reverting. I don't > > > like > > > > >> it, > > > > but I'm not willing to die on that hill. :) > > > > > > > > -S. > > > > > > > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > > wrote: > > > > > > > > Spitting this into a separate thread. > > > > > > > > > > I see the issue. The two minute timeout is the constructor for > > > > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > > > > > > > That behavior makes sense for CacheServer.start. > > > > > > > > > > But it doesn't make sense for the new logic in > > > > GatewayReceiver.start() > > > > > from > > > > > GEODE-5591. That code is trying to use CacheServer.start to > scan > > > for > > > > an > > > > > available port, trying each port in a range. That free port > > finding > > > > >> logic > > > > > really doesn't want to have two minutes of retries for each > port. > > > It > > > > > seems > > > > > like we need to rework the fix for GEODE-5591. > > > > > > > > > > Does it make sense to hold up the release to rework this fix, > or > > > > should > > > > > we > > > > > just revert it? Have we switched concourse over to using alpine > > > > linux, > > > > > which I think was the original motivation for this fix? > > > > > > > > > > -Dan > > > > > > > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith > > > wrote: > > > > > > > > > > Why is it waiting at all in this case? Where is this 2 minute > > > timeout > > > > >> coming from? > > > > >> > > > > >> -Dan > > > > >> > > > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > > > > >> > > > > > sai.boorlaga...@gmail.com > > > > > > > > > >> wrote: > > > > >>> So the issue is that it takes longer to start than previous > > > > releases? > > > > >>> Also, is this wait time only when using Gfsh to create > > > > >>> gateway-receiver? > > > > >>> > > > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > > > > wrote: > > > > >>> > > > > >>> Currently we have a minor issue in the release branch as > > pointed > > > > out > > > > > > > > >>> by > > > > > > > > > >> Barry O. > > > > We will wait till a resolution is figured out for this > issue. > > > > > > > > Steps: > > > > 1. create locator > > > > 2. start server --name=server1 --server-port=40404 > > > > 3. start server --name=server2 --server-port=40405 > > > > 4. create gateway-receiver --member=server1 > > > > 5. create gateway-receiver --member=server2 `This gets stuck > > > for 2 > > > > > > > > >>> minutes` > > > > >>> > > > > Is the 2 minute wait time acceptable? Should we document it? > > > When > > > > we > > > > > > > > >>> revert > > > > >>> > > > > GEODE-5591, this issue
Re: 2 minute gateway startup time due to GEODE-5591
+1 for reverting in both places. I see that there is already an isGatewayReceiver flag in the AcceptorImpl constructor. It's not ideal, but could we use this flag to prevent the 2 minute retry logic for happening if this flag is true? Ryan On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < lhughesgodf...@pivotal.io> wrote: > +1 for reverting in both places. > > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > > > +1 for reverting in both places. The current fix is not better, that's > why > > we are reverting it on the release branch! > > > > -Dan > > > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett > wrote: > > > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in > > develop. > > > We shouldn’t go backwards in develop. The current fix is better than > the > > > bug it fixes. > > > > > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > > > > > > > If everyone is okay with it, I will revert that change in develop and > > > then > > > > cherry pick it to release/1.7.0 branch. > > > > Please do comment. > > > > > > > > Regards > > > > Nabarun Nag > > > > > > > > > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: > > > >> > > > >> +1 to yank it and rework the fix. > > > >> > > > >> Gester's change helps, but it just means that you will sometimes > > > randomly > > > >> have a 2 minute delay starting up a gateway receiver. I don't think > > > that is > > > >> a great user experience either. > > > >> > > > >> -Dan > > > >> > > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > > > bschucha...@pivotal.io> > > > >> wrote: > > > >> > > > >>> Let's yank it > > > >>> > > > >>> > > > >>> > > > On 9/4/18 5:04 PM, Sean Goller wrote: > > > > > > If it's to get the release out, I'm fine with reverting. I don't > > like > > > >> it, > > > but I'm not willing to die on that hill. :) > > > > > > -S. > > > > > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith > wrote: > > > > > > Spitting this into a separate thread. > > > > > > > > I see the issue. The two minute timeout is the constructor for > > > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > > > > > That behavior makes sense for CacheServer.start. > > > > > > > > But it doesn't make sense for the new logic in > > > GatewayReceiver.start() > > > > from > > > > GEODE-5591. That code is trying to use CacheServer.start to scan > > for > > > an > > > > available port, trying each port in a range. That free port > finding > > > >> logic > > > > really doesn't want to have two minutes of retries for each port. > > It > > > > seems > > > > like we need to rework the fix for GEODE-5591. > > > > > > > > Does it make sense to hold up the release to rework this fix, or > > > should > > > > we > > > > just revert it? Have we switched concourse over to using alpine > > > linux, > > > > which I think was the original motivation for this fix? > > > > > > > > -Dan > > > > > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith > > wrote: > > > > > > > > Why is it waiting at all in this case? Where is this 2 minute > > timeout > > > >> coming from? > > > >> > > > >> -Dan > > > >> > > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > > > >> > > > > sai.boorlaga...@gmail.com > > > > > > > >> wrote: > > > >>> So the issue is that it takes longer to start than previous > > > releases? > > > >>> Also, is this wait time only when using Gfsh to create > > > >>> gateway-receiver? > > > >>> > > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > > > wrote: > > > >>> > > > >>> Currently we have a minor issue in the release branch as > pointed > > > out > > > > > > >>> by > > > > > > > >> Barry O. > > > We will wait till a resolution is figured out for this issue. > > > > > > Steps: > > > 1. create locator > > > 2. start server --name=server1 --server-port=40404 > > > 3. start server --name=server2 --server-port=40405 > > > 4. create gateway-receiver --member=server1 > > > 5. create gateway-receiver --member=server2 `This gets stuck > > for 2 > > > > > > >>> minutes` > > > >>> > > > Is the 2 minute wait time acceptable? Should we document it? > > When > > > we > > > > > > >>> revert > > > >>> > > > GEODE-5591, this issue does not happen. > > > > > > Regards > > > Nabarun Nag > > > > > > > > > >>> > > > >> > > > > > >
Re: 2 minute gateway startup time due to GEODE-5591
+1 for reverting in both places. On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith wrote: > +1 for reverting in both places. The current fix is not better, that's why > we are reverting it on the release branch! > > -Dan > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett wrote: > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in > develop. > > We shouldn’t go backwards in develop. The current fix is better than the > > bug it fixes. > > > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > > > > > If everyone is okay with it, I will revert that change in develop and > > then > > > cherry pick it to release/1.7.0 branch. > > > Please do comment. > > > > > > Regards > > > Nabarun Nag > > > > > > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: > > >> > > >> +1 to yank it and rework the fix. > > >> > > >> Gester's change helps, but it just means that you will sometimes > > randomly > > >> have a 2 minute delay starting up a gateway receiver. I don't think > > that is > > >> a great user experience either. > > >> > > >> -Dan > > >> > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > > bschucha...@pivotal.io> > > >> wrote: > > >> > > >>> Let's yank it > > >>> > > >>> > > >>> > > On 9/4/18 5:04 PM, Sean Goller wrote: > > > > If it's to get the release out, I'm fine with reverting. I don't > like > > >> it, > > but I'm not willing to die on that hill. :) > > > > -S. > > > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: > > > > Spitting this into a separate thread. > > > > > > I see the issue. The two minute timeout is the constructor for > > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > > > That behavior makes sense for CacheServer.start. > > > > > > But it doesn't make sense for the new logic in > > GatewayReceiver.start() > > > from > > > GEODE-5591. That code is trying to use CacheServer.start to scan > for > > an > > > available port, trying each port in a range. That free port finding > > >> logic > > > really doesn't want to have two minutes of retries for each port. > It > > > seems > > > like we need to rework the fix for GEODE-5591. > > > > > > Does it make sense to hold up the release to rework this fix, or > > should > > > we > > > just revert it? Have we switched concourse over to using alpine > > linux, > > > which I think was the original motivation for this fix? > > > > > > -Dan > > > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith > wrote: > > > > > > Why is it waiting at all in this case? Where is this 2 minute > timeout > > >> coming from? > > >> > > >> -Dan > > >> > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > > >> > > > sai.boorlaga...@gmail.com > > > > > >> wrote: > > >>> So the issue is that it takes longer to start than previous > > releases? > > >>> Also, is this wait time only when using Gfsh to create > > >>> gateway-receiver? > > >>> > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > > wrote: > > >>> > > >>> Currently we have a minor issue in the release branch as pointed > > out > > > > >>> by > > > > > >> Barry O. > > We will wait till a resolution is figured out for this issue. > > > > Steps: > > 1. create locator > > 2. start server --name=server1 --server-port=40404 > > 3. start server --name=server2 --server-port=40405 > > 4. create gateway-receiver --member=server1 > > 5. create gateway-receiver --member=server2 `This gets stuck > for 2 > > > > >>> minutes` > > >>> > > Is the 2 minute wait time acceptable? Should we document it? > When > > we > > > > >>> revert > > >>> > > GEODE-5591, this issue does not happen. > > > > Regards > > Nabarun Nag > > > > > > >>> > > >> > > >
Re: 2 minute gateway startup time due to GEODE-5591
+1 for reverting in both places. The current fix is not better, that's why we are reverting it on the release branch! -Dan On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett wrote: > I’m not ok with reverting in develop. Revert in 1.7 and modify in develop. > We shouldn’t go backwards in develop. The current fix is better than the > bug it fixes. > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > > > If everyone is okay with it, I will revert that change in develop and > then > > cherry pick it to release/1.7.0 branch. > > Please do comment. > > > > Regards > > Nabarun Nag > > > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: > >> > >> +1 to yank it and rework the fix. > >> > >> Gester's change helps, but it just means that you will sometimes > randomly > >> have a 2 minute delay starting up a gateway receiver. I don't think > that is > >> a great user experience either. > >> > >> -Dan > >> > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > bschucha...@pivotal.io> > >> wrote: > >> > >>> Let's yank it > >>> > >>> > >>> > On 9/4/18 5:04 PM, Sean Goller wrote: > > If it's to get the release out, I'm fine with reverting. I don't like > >> it, > but I'm not willing to die on that hill. :) > > -S. > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: > > Spitting this into a separate thread. > > > > I see the issue. The two minute timeout is the constructor for > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > That behavior makes sense for CacheServer.start. > > > > But it doesn't make sense for the new logic in > GatewayReceiver.start() > > from > > GEODE-5591. That code is trying to use CacheServer.start to scan for > an > > available port, trying each port in a range. That free port finding > >> logic > > really doesn't want to have two minutes of retries for each port. It > > seems > > like we need to rework the fix for GEODE-5591. > > > > Does it make sense to hold up the release to rework this fix, or > should > > we > > just revert it? Have we switched concourse over to using alpine > linux, > > which I think was the original motivation for this fix? > > > > -Dan > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > > > > Why is it waiting at all in this case? Where is this 2 minute timeout > >> coming from? > >> > >> -Dan > >> > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > >> > > sai.boorlaga...@gmail.com > > > >> wrote: > >>> So the issue is that it takes longer to start than previous > releases? > >>> Also, is this wait time only when using Gfsh to create > >>> gateway-receiver? > >>> > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > wrote: > >>> > >>> Currently we have a minor issue in the release branch as pointed > out > > >>> by > > > >> Barry O. > We will wait till a resolution is figured out for this issue. > > Steps: > 1. create locator > 2. start server --name=server1 --server-port=40404 > 3. start server --name=server2 --server-port=40405 > 4. create gateway-receiver --member=server1 > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > > >>> minutes` > >>> > Is the 2 minute wait time acceptable? Should we document it? When > we > > >>> revert > >>> > GEODE-5591, this issue does not happen. > > Regards > Nabarun Nag > > > >>> > >> >
Re: 2 minute gateway startup time due to GEODE-5591
+1 to revert in 1.7 and leaving the fix on develop. On Wed, Sep 5, 2018 at 9:47 AM Jacob Barrett wrote: > I’m not ok with reverting in develop. Revert in 1.7 and modify in develop. > We shouldn’t go backwards in develop. The current fix is better than the > bug it fixes. > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag wrote: > > > > If everyone is okay with it, I will revert that change in develop and > then > > cherry pick it to release/1.7.0 branch. > > Please do comment. > > > > Regards > > Nabarun Nag > > > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith wrote: > >> > >> +1 to yank it and rework the fix. > >> > >> Gester's change helps, but it just means that you will sometimes > randomly > >> have a 2 minute delay starting up a gateway receiver. I don't think > that is > >> a great user experience either. > >> > >> -Dan > >> > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > bschucha...@pivotal.io> > >> wrote: > >> > >>> Let's yank it > >>> > >>> > >>> > On 9/4/18 5:04 PM, Sean Goller wrote: > > If it's to get the release out, I'm fine with reverting. I don't like > >> it, > but I'm not willing to die on that hill. :) > > -S. > > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: > > Spitting this into a separate thread. > > > > I see the issue. The two minute timeout is the constructor for > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > That behavior makes sense for CacheServer.start. > > > > But it doesn't make sense for the new logic in > GatewayReceiver.start() > > from > > GEODE-5591. That code is trying to use CacheServer.start to scan for > an > > available port, trying each port in a range. That free port finding > >> logic > > really doesn't want to have two minutes of retries for each port. It > > seems > > like we need to rework the fix for GEODE-5591. > > > > Does it make sense to hold up the release to rework this fix, or > should > > we > > just revert it? Have we switched concourse over to using alpine > linux, > > which I think was the original motivation for this fix? > > > > -Dan > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > > > > Why is it waiting at all in this case? Where is this 2 minute timeout > >> coming from? > >> > >> -Dan > >> > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > >> > > sai.boorlaga...@gmail.com > > > >> wrote: > >>> So the issue is that it takes longer to start than previous > releases? > >>> Also, is this wait time only when using Gfsh to create > >>> gateway-receiver? > >>> > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag > wrote: > >>> > >>> Currently we have a minor issue in the release branch as pointed > out > > >>> by > > > >> Barry O. > We will wait till a resolution is figured out for this issue. > > Steps: > 1. create locator > 2. start server --name=server1 --server-port=40404 > 3. start server --name=server2 --server-port=40405 > 4. create gateway-receiver --member=server1 > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > > >>> minutes` > >>> > Is the 2 minute wait time acceptable? Should we document it? When > we > > >>> revert > >>> > GEODE-5591, this issue does not happen. > > Regards > Nabarun Nag > > > >>> > >> >
Re: 2 minute gateway startup time due to GEODE-5591
+1 to yank it and rework the fix. Gester's change helps, but it just means that you will sometimes randomly have a 2 minute delay starting up a gateway receiver. I don't think that is a great user experience either. -Dan On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt wrote: > Let's yank it > > > > On 9/4/18 5:04 PM, Sean Goller wrote: > >> If it's to get the release out, I'm fine with reverting. I don't like it, >> but I'm not willing to die on that hill. :) >> >> -S. >> >> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: >> >> Spitting this into a separate thread. >>> >>> I see the issue. The two minute timeout is the constructor for >>> AcceptorImpl, where it retries to bind for 2 minutes. >>> >>> That behavior makes sense for CacheServer.start. >>> >>> But it doesn't make sense for the new logic in GatewayReceiver.start() >>> from >>> GEODE-5591. That code is trying to use CacheServer.start to scan for an >>> available port, trying each port in a range. That free port finding logic >>> really doesn't want to have two minutes of retries for each port. It >>> seems >>> like we need to rework the fix for GEODE-5591. >>> >>> Does it make sense to hold up the release to rework this fix, or should >>> we >>> just revert it? Have we switched concourse over to using alpine linux, >>> which I think was the original motivation for this fix? >>> >>> -Dan >>> >>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: >>> >>> Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < >>> sai.boorlaga...@gmail.com >>> wrote: > So the issue is that it takes longer to start than previous releases? > Also, is this wait time only when using Gfsh to create > gateway-receiver? > > On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: > > Currently we have a minor issue in the release branch as pointed out >> > by >>> Barry O. >> We will wait till a resolution is figured out for this issue. >> >> Steps: >> 1. create locator >> 2. start server --name=server1 --server-port=40404 >> 3. start server --name=server2 --server-port=40405 >> 4. create gateway-receiver --member=server1 >> 5. create gateway-receiver --member=server2 `This gets stuck for 2 >> > minutes` > >> Is the 2 minute wait time acceptable? Should we document it? When we >> > revert > >> GEODE-5591, this issue does not happen. >> >> Regards >> Nabarun Nag >> >> >
Re: 2 minute gateway startup time due to GEODE-5591
Let's yank it On 9/4/18 5:04 PM, Sean Goller wrote: If it's to get the release out, I'm fine with reverting. I don't like it, but I'm not willing to die on that hill. :) -S. On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan for an available port, trying each port in a range. That free port finding logic really doesn't want to have two minutes of retries for each port. It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or should we just revert it? Have we switched concourse over to using alpine linux, which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < sai.boorlaga...@gmail.com wrote: So the issue is that it takes longer to start than previous releases? Also, is this wait time only when using Gfsh to create gateway-receiver? On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: Currently we have a minor issue in the release branch as pointed out by Barry O. We will wait till a resolution is figured out for this issue. Steps: 1. create locator 2. start server --name=server1 --server-port=40404 3. start server --name=server2 --server-port=40405 4. create gateway-receiver --member=server1 5. create gateway-receiver --member=server2 `This gets stuck for 2 minutes` Is the 2 minute wait time acceptable? Should we document it? When we revert GEODE-5591, this issue does not happen. Regards Nabarun Nag
Re: 2 minute gateway startup time due to GEODE-5591
As downstream consumers of Geode, we do not want to be exposed to this. Please revert and fix on develop. Also, could we put a test case to guard us against this in future? Thanks, *Pulkit Chandra* On Wed, Sep 5, 2018 at 1:07 AM Xiaojian Zhou wrote: > Yes. The current fix is to let each gateway receiver (in hydra tests, > there're a lot) to compete port 5500. Only one member will win, all other > members will timeout after 2 minutes. Then they keep compete for port 5501. > Again, only one member will win. > > In that case, if there are 5 receivers, it will take 10 minutes to start > all the receivers. > > So I enhanced the current fix (see the diff attached) to let each receiver > to pick a random port to start, if any one failed, only this guy will try > currPort++. If reached endPort, continue on startPort, until reached his > random port again. > > To enhance the 2-minute-timeout is definitely another issue. > > Regards > Gester > > On Tue, Sep 4, 2018 at 4:38 PM, Dan Smith wrote: > >> Spitting this into a separate thread. >> >> I see the issue. The two minute timeout is the constructor for >> AcceptorImpl, where it retries to bind for 2 minutes. >> >> That behavior makes sense for CacheServer.start. >> >> But it doesn't make sense for the new logic in GatewayReceiver.start() >> from >> GEODE-5591. That code is trying to use CacheServer.start to scan for an >> available port, trying each port in a range. That free port finding logic >> really doesn't want to have two minutes of retries for each port. It seems >> like we need to rework the fix for GEODE-5591. >> >> Does it make sense to hold up the release to rework this fix, or should we >> just revert it? Have we switched concourse over to using alpine linux, >> which I think was the original motivation for this fix? >> >> -Dan >> >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: >> >> > Why is it waiting at all in this case? Where is this 2 minute timeout >> > coming from? >> > >> > -Dan >> > >> > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < >> sai.boorlaga...@gmail.com >> > > wrote: >> > >> >> So the issue is that it takes longer to start than previous releases? >> >> Also, is this wait time only when using Gfsh to create >> gateway-receiver? >> >> >> >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: >> >> >> >> > Currently we have a minor issue in the release branch as pointed out >> by >> >> > Barry O. >> >> > We will wait till a resolution is figured out for this issue. >> >> > >> >> > Steps: >> >> > 1. create locator >> >> > 2. start server --name=server1 --server-port=40404 >> >> > 3. start server --name=server2 --server-port=40405 >> >> > 4. create gateway-receiver --member=server1 >> >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2 >> >> minutes` >> >> > >> >> > Is the 2 minute wait time acceptable? Should we document it? When we >> >> revert >> >> > GEODE-5591, this issue does not happen. >> >> > >> >> > Regards >> >> > Nabarun Nag >> >> > >> >> >> > >> > >
Re: 2 minute gateway startup time due to GEODE-5591
Yes. The current fix is to let each gateway receiver (in hydra tests, there're a lot) to compete port 5500. Only one member will win, all other members will timeout after 2 minutes. Then they keep compete for port 5501. Again, only one member will win. In that case, if there are 5 receivers, it will take 10 minutes to start all the receivers. So I enhanced the current fix (see the diff attached) to let each receiver to pick a random port to start, if any one failed, only this guy will try currPort++. If reached endPort, continue on startPort, until reached his random port again. To enhance the 2-minute-timeout is definitely another issue. Regards Gester On Tue, Sep 4, 2018 at 4:38 PM, Dan Smith wrote: > Spitting this into a separate thread. > > I see the issue. The two minute timeout is the constructor for > AcceptorImpl, where it retries to bind for 2 minutes. > > That behavior makes sense for CacheServer.start. > > But it doesn't make sense for the new logic in GatewayReceiver.start() from > GEODE-5591. That code is trying to use CacheServer.start to scan for an > available port, trying each port in a range. That free port finding logic > really doesn't want to have two minutes of retries for each port. It seems > like we need to rework the fix for GEODE-5591. > > Does it make sense to hold up the release to rework this fix, or should we > just revert it? Have we switched concourse over to using alpine linux, > which I think was the original motivation for this fix? > > -Dan > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > > > Why is it waiting at all in this case? Where is this 2 minute timeout > > coming from? > > > > -Dan > > > > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > sai.boorlaga...@gmail.com > > > wrote: > > > >> So the issue is that it takes longer to start than previous releases? > >> Also, is this wait time only when using Gfsh to create gateway-receiver? > >> > >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: > >> > >> > Currently we have a minor issue in the release branch as pointed out > by > >> > Barry O. > >> > We will wait till a resolution is figured out for this issue. > >> > > >> > Steps: > >> > 1. create locator > >> > 2. start server --name=server1 --server-port=40404 > >> > 3. start server --name=server2 --server-port=40405 > >> > 4. create gateway-receiver --member=server1 > >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > >> minutes` > >> > > >> > Is the 2 minute wait time acceptable? Should we document it? When we > >> revert > >> > GEODE-5591, this issue does not happen. > >> > > >> > Regards > >> > Nabarun Nag > >> > > >> > > > diff --git a/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java b/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java index a09194209..e13e7ec78 100644 --- a/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java +++ b/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java @@ -2020,7 +2020,7 @@ public class WANTestBase extends DistributedTestCase { GatewayReceiver receiver = fact.create(); assertThatThrownBy(receiver::start) .isInstanceOf(GatewayReceiverException.class) -.hasMessageContaining("No available free port found in the given range"); +.hasMessageContaining("Failed to create server socket on"); } public static int createReceiverWithSSL(int locPort) { diff --git a/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java b/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java index 038b759ae..ccd9503e6 100644 --- a/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java +++ b/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java @@ -448,7 +448,8 @@ public class WANConfigurationJUnitTest { GatewayReceiver receiver = fact.create(); -assertThatThrownBy(() -> receiver.start()).isInstanceOf(GatewayReceiverException.class); +assertThatThrownBy(() -> receiver.start()).isInstanceOf(GatewayReceiverException.class) +.hasMessageContaining("Failed to create server socket on"); } @Test diff --git a/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java b/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java index cd2702991..786b354a4 100644 --- a/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java +++ b/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java @@ -26,6 +26,7 @@ import org.apache.geode.cache.wan.GatewayReceiver; import org.apache.geode.cache.wan.GatewayTransportFilter; import
Re: 2 minute gateway startup time due to GEODE-5591
Revert it on the release branch and fix it on develop. > On Sep 4, 2018, at 5:13 PM, Sean Goller wrote: > > It affects us on any linux platform that doesn't use glibc. It's not worth > holding up the release for. It's been this way for 20 years, right? ;) > > Revert it. > >> On Tue, Sep 4, 2018 at 5:09 PM Udo Kohlmeyer wrote: >> >> Imo (and I'm coming in cold)... We are NOT officially supporting Alpine >> linux (yet), which is the basis for this ticket, maybe push this to a >> later release? >> >> I prefer us getting out the fixes we have and release a more optimal >> version of GEODE-5591 later. >> >> IF this is a bug that will affect us on EVERY linux distro, then we >> should fix, otherwise, I vote to push it to 1.8 >> >> --Udo >> >> >>> On 9/4/18 16:38, Dan Smith wrote: >>> Spitting this into a separate thread. >>> >>> I see the issue. The two minute timeout is the constructor for >>> AcceptorImpl, where it retries to bind for 2 minutes. >>> >>> That behavior makes sense for CacheServer.start. >>> >>> But it doesn't make sense for the new logic in GatewayReceiver.start() >> from >>> GEODE-5591. That code is trying to use CacheServer.start to scan for an >>> available port, trying each port in a range. That free port finding logic >>> really doesn't want to have two minutes of retries for each port. It >> seems >>> like we need to rework the fix for GEODE-5591. >>> >>> Does it make sense to hold up the release to rework this fix, or should >> we >>> just revert it? Have we switched concourse over to using alpine linux, >>> which I think was the original motivation for this fix? >>> >>> -Dan >>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < >> sai.boorlaga...@gmail.com > wrote: > So the issue is that it takes longer to start than previous releases? > Also, is this wait time only when using Gfsh to create >> gateway-receiver? > >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: >> >> Currently we have a minor issue in the release branch as pointed out >> by >> Barry O. >> We will wait till a resolution is figured out for this issue. >> >> Steps: >> 1. create locator >> 2. start server --name=server1 --server-port=40404 >> 3. start server --name=server2 --server-port=40405 >> 4. create gateway-receiver --member=server1 >> 5. create gateway-receiver --member=server2 `This gets stuck for 2 > minutes` >> Is the 2 minute wait time acceptable? Should we document it? When we > revert >> GEODE-5591, this issue does not happen. >> >> Regards >> Nabarun Nag >> >> >>
Re: 2 minute gateway startup time due to GEODE-5591
We should fix this for the release. -Anil. On Tue, Sep 4, 2018 at 5:09 PM Udo Kohlmeyer wrote: > Imo (and I'm coming in cold)... We are NOT officially supporting Alpine > linux (yet), which is the basis for this ticket, maybe push this to a > later release? > > I prefer us getting out the fixes we have and release a more optimal > version of GEODE-5591 later. > > IF this is a bug that will affect us on EVERY linux distro, then we > should fix, otherwise, I vote to push it to 1.8 > > --Udo > > > On 9/4/18 16:38, Dan Smith wrote: > > Spitting this into a separate thread. > > > > I see the issue. The two minute timeout is the constructor for > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > That behavior makes sense for CacheServer.start. > > > > But it doesn't make sense for the new logic in GatewayReceiver.start() > from > > GEODE-5591. That code is trying to use CacheServer.start to scan for an > > available port, trying each port in a range. That free port finding logic > > really doesn't want to have two minutes of retries for each port. It > seems > > like we need to rework the fix for GEODE-5591. > > > > Does it make sense to hold up the release to rework this fix, or should > we > > just revert it? Have we switched concourse over to using alpine linux, > > which I think was the original motivation for this fix? > > > > -Dan > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > > > >> Why is it waiting at all in this case? Where is this 2 minute timeout > >> coming from? > >> > >> -Dan > >> > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > sai.boorlaga...@gmail.com > >>> wrote: > >>> So the issue is that it takes longer to start than previous releases? > >>> Also, is this wait time only when using Gfsh to create > gateway-receiver? > >>> > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: > >>> > Currently we have a minor issue in the release branch as pointed out > by > Barry O. > We will wait till a resolution is figured out for this issue. > > Steps: > 1. create locator > 2. start server --name=server1 --server-port=40404 > 3. start server --name=server2 --server-port=40405 > 4. create gateway-receiver --member=server1 > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > >>> minutes` > Is the 2 minute wait time acceptable? Should we document it? When we > >>> revert > GEODE-5591, this issue does not happen. > > Regards > Nabarun Nag > > >
Re: 2 minute gateway startup time due to GEODE-5591
It affects us on any linux platform that doesn't use glibc. It's not worth holding up the release for. It's been this way for 20 years, right? ;) Revert it. On Tue, Sep 4, 2018 at 5:09 PM Udo Kohlmeyer wrote: > Imo (and I'm coming in cold)... We are NOT officially supporting Alpine > linux (yet), which is the basis for this ticket, maybe push this to a > later release? > > I prefer us getting out the fixes we have and release a more optimal > version of GEODE-5591 later. > > IF this is a bug that will affect us on EVERY linux distro, then we > should fix, otherwise, I vote to push it to 1.8 > > --Udo > > > On 9/4/18 16:38, Dan Smith wrote: > > Spitting this into a separate thread. > > > > I see the issue. The two minute timeout is the constructor for > > AcceptorImpl, where it retries to bind for 2 minutes. > > > > That behavior makes sense for CacheServer.start. > > > > But it doesn't make sense for the new logic in GatewayReceiver.start() > from > > GEODE-5591. That code is trying to use CacheServer.start to scan for an > > available port, trying each port in a range. That free port finding logic > > really doesn't want to have two minutes of retries for each port. It > seems > > like we need to rework the fix for GEODE-5591. > > > > Does it make sense to hold up the release to rework this fix, or should > we > > just revert it? Have we switched concourse over to using alpine linux, > > which I think was the original motivation for this fix? > > > > -Dan > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > > > >> Why is it waiting at all in this case? Where is this 2 minute timeout > >> coming from? > >> > >> -Dan > >> > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > sai.boorlaga...@gmail.com > >>> wrote: > >>> So the issue is that it takes longer to start than previous releases? > >>> Also, is this wait time only when using Gfsh to create > gateway-receiver? > >>> > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: > >>> > Currently we have a minor issue in the release branch as pointed out > by > Barry O. > We will wait till a resolution is figured out for this issue. > > Steps: > 1. create locator > 2. start server --name=server1 --server-port=40404 > 3. start server --name=server2 --server-port=40405 > 4. create gateway-receiver --member=server1 > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > >>> minutes` > Is the 2 minute wait time acceptable? Should we document it? When we > >>> revert > GEODE-5591, this issue does not happen. > > Regards > Nabarun Nag > > >
Re: 2 minute gateway startup time due to GEODE-5591
Imo (and I'm coming in cold)... We are NOT officially supporting Alpine linux (yet), which is the basis for this ticket, maybe push this to a later release? I prefer us getting out the fixes we have and release a more optimal version of GEODE-5591 later. IF this is a bug that will affect us on EVERY linux distro, then we should fix, otherwise, I vote to push it to 1.8 --Udo On 9/4/18 16:38, Dan Smith wrote: Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan for an available port, trying each port in a range. That free port finding logic really doesn't want to have two minutes of retries for each port. It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or should we just revert it? Have we switched concourse over to using alpine linux, which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: Why is it waiting at all in this case? Where is this 2 minute timeout coming from? -Dan On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda wrote: So the issue is that it takes longer to start than previous releases? Also, is this wait time only when using Gfsh to create gateway-receiver? On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: Currently we have a minor issue in the release branch as pointed out by Barry O. We will wait till a resolution is figured out for this issue. Steps: 1. create locator 2. start server --name=server1 --server-port=40404 3. start server --name=server2 --server-port=40405 4. create gateway-receiver --member=server1 5. create gateway-receiver --member=server2 `This gets stuck for 2 minutes` Is the 2 minute wait time acceptable? Should we document it? When we revert GEODE-5591, this issue does not happen. Regards Nabarun Nag
Re: 2 minute gateway startup time due to GEODE-5591
If it's to get the release out, I'm fine with reverting. I don't like it, but I'm not willing to die on that hill. :) -S. On Tue, Sep 4, 2018 at 4:38 PM Dan Smith wrote: > Spitting this into a separate thread. > > I see the issue. The two minute timeout is the constructor for > AcceptorImpl, where it retries to bind for 2 minutes. > > That behavior makes sense for CacheServer.start. > > But it doesn't make sense for the new logic in GatewayReceiver.start() from > GEODE-5591. That code is trying to use CacheServer.start to scan for an > available port, trying each port in a range. That free port finding logic > really doesn't want to have two minutes of retries for each port. It seems > like we need to rework the fix for GEODE-5591. > > Does it make sense to hold up the release to rework this fix, or should we > just revert it? Have we switched concourse over to using alpine linux, > which I think was the original motivation for this fix? > > -Dan > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > > > Why is it waiting at all in this case? Where is this 2 minute timeout > > coming from? > > > > -Dan > > > > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > sai.boorlaga...@gmail.com > > > wrote: > > > >> So the issue is that it takes longer to start than previous releases? > >> Also, is this wait time only when using Gfsh to create gateway-receiver? > >> > >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: > >> > >> > Currently we have a minor issue in the release branch as pointed out > by > >> > Barry O. > >> > We will wait till a resolution is figured out for this issue. > >> > > >> > Steps: > >> > 1. create locator > >> > 2. start server --name=server1 --server-port=40404 > >> > 3. start server --name=server2 --server-port=40405 > >> > 4. create gateway-receiver --member=server1 > >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2 > >> minutes` > >> > > >> > Is the 2 minute wait time acceptable? Should we document it? When we > >> revert > >> > GEODE-5591, this issue does not happen. > >> > > >> > Regards > >> > Nabarun Nag > >> > > >> > > >
2 minute gateway startup time due to GEODE-5591
Spitting this into a separate thread. I see the issue. The two minute timeout is the constructor for AcceptorImpl, where it retries to bind for 2 minutes. That behavior makes sense for CacheServer.start. But it doesn't make sense for the new logic in GatewayReceiver.start() from GEODE-5591. That code is trying to use CacheServer.start to scan for an available port, trying each port in a range. That free port finding logic really doesn't want to have two minutes of retries for each port. It seems like we need to rework the fix for GEODE-5591. Does it make sense to hold up the release to rework this fix, or should we just revert it? Have we switched concourse over to using alpine linux, which I think was the original motivation for this fix? -Dan On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith wrote: > Why is it waiting at all in this case? Where is this 2 minute timeout > coming from? > > -Dan > > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda > wrote: > >> So the issue is that it takes longer to start than previous releases? >> Also, is this wait time only when using Gfsh to create gateway-receiver? >> >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag wrote: >> >> > Currently we have a minor issue in the release branch as pointed out by >> > Barry O. >> > We will wait till a resolution is figured out for this issue. >> > >> > Steps: >> > 1. create locator >> > 2. start server --name=server1 --server-port=40404 >> > 3. start server --name=server2 --server-port=40405 >> > 4. create gateway-receiver --member=server1 >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2 >> minutes` >> > >> > Is the 2 minute wait time acceptable? Should we document it? When we >> revert >> > GEODE-5591, this issue does not happen. >> > >> > Regards >> > Nabarun Nag >> > >> >