Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Xiaojian Zhou
OK, after discussion with Jason and Ryan. a PR #2425 is ready. It contains
fix for 3 issues, including skipping the 2-minute-timeout.

On Wed, Sep 5, 2018 at 11:03 AM, Udo Kohlmeyer  wrote:

> +1
>
>
>
> On 9/5/18 10:35, Anthony Baker wrote:
>
>> Before this improvement is re-merged I’d like to see:
>>
>> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2
>> min when there’s a port conflict)
>> 2) A test that demonstrates how the current logic is insufficient
>>
>> Anthony
>>
>>
>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag  wrote:
>>>
>>> GEODE-5591 has been reverted in develop
>>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330
>>>
>>> Regards
>>> Nabarun Nag
>>>
>>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 
>>> wrote:
>>>
>>> +1 for reverting in both places.

 I see that there is already an isGatewayReceiver flag in the
 AcceptorImpl
 constructor.  It's not ideal, but could we use this flag to prevent the
 2
 minute retry logic for happening if this flag is true?

 Ryan

 On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
 lhughesgodf...@pivotal.io> wrote:

 +1 for reverting in both places.
>
> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
>
> +1 for reverting in both places. The current fix is not better, that's
>>
> why
>
>> we are reverting it on the release branch!
>>
>> -Dan
>>
>> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
>>
> wrote:
>
>> I’m not ok with reverting in develop. Revert in 1.7 and modify in
>>>
>> develop.
>>
>>> We shouldn’t go backwards in develop. The current fix is better than
>>>
>> the
>
>> bug it fixes.
>>>
>>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:

 If everyone is okay with it, I will revert that change in develop

>>> and

> then
>>>
 cherry pick it to release/1.7.0 branch.
 Please do comment.

 Regards
 Nabarun Nag


 On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
>
 wrote:

> +1 to yank it and rework the fix.
>
> Gester's change helps, but it just means that you will sometimes
>
 randomly
>>>
 have a 2 minute delay starting up a gateway receiver. I don't
>
 think

> that is
>>>
 a great user experience either.
>
> -Dan
>
> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
>
 bschucha...@pivotal.io>
>>>
 wrote:
>
> Let's yank it
>>
>>
>>
>> On 9/4/18 5:04 PM, Sean Goller wrote:
>>>
>>> If it's to get the release out, I'm fine with reverting. I don't
>>>
>> like
>>
>>> it,
>
>> but I'm not willing to die on that hill. :)
>>>
>>> -S.
>>>
>>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
>>>
>> wrote:
>
>> Spitting this into a separate thread.
>>>
 I see the issue. The two minute timeout is the constructor for
 AcceptorImpl, where it retries to bind for 2 minutes.

 That behavior makes sense for CacheServer.start.

 But it doesn't make sense for the new logic in

>>> GatewayReceiver.start()
>>>
 from
 GEODE-5591. That code is trying to use CacheServer.start to

>>> scan

> for
>>
>>> an
>>>
 available port, trying each port in a range. That free port

>>> finding
>
>> logic
>
>> really doesn't want to have two minutes of retries for each

>>> port.

> It
>>
>>> seems
 like we need to rework the fix for GEODE-5591.

 Does it make sense to hold up the release to rework this fix,

>>> or

> should
>>>
 we
 just revert it? Have we switched concourse over to using alpine

>>> linux,
>>>
 which I think was the original motivation for this fix?

 -Dan

 On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 

>>> wrote:
>>
>>> Why is it waiting at all in this case? Where is this 2 minute

>>> timeout
>>
>>> coming from?
>
> -Dan
>
> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
>
> sai.boorlaga...@gmail.com

 wrote:
>
>> So the issue is that it takes longer to start than previous
>>
> releases?
>>>
 Also, is this wait time only when using Gfsh to create

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Udo Kohlmeyer

+1


On 9/5/18 10:35, Anthony Baker wrote:

Before this improvement is re-merged I’d like to see:

1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min when 
there’s a port conflict)
2) A test that demonstrates how the current logic is insufficient

Anthony



On Sep 5, 2018, at 10:20 AM, Nabarun Nag  wrote:

GEODE-5591 has been reverted in develop
ref: 901da27f227a8ce2b7d6b681619782a1accd9330

Regards
Nabarun Nag

On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon  wrote:


+1 for reverting in both places.

I see that there is already an isGatewayReceiver flag in the AcceptorImpl
constructor.  It's not ideal, but could we use this flag to prevent the 2
minute retry logic for happening if this flag is true?

Ryan

On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
lhughesgodf...@pivotal.io> wrote:


+1 for reverting in both places.

On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:


+1 for reverting in both places. The current fix is not better, that's

why

we are reverting it on the release branch!

-Dan

On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 

wrote:

I’m not ok with reverting in develop. Revert in 1.7 and modify in

develop.

We shouldn’t go backwards in develop. The current fix is better than

the

bug it fixes.


On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:

If everyone is okay with it, I will revert that change in develop

and

then

cherry pick it to release/1.7.0 branch.
Please do comment.

Regards
Nabarun Nag



On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 

wrote:

+1 to yank it and rework the fix.

Gester's change helps, but it just means that you will sometimes

randomly

have a 2 minute delay starting up a gateway receiver. I don't

think

that is

a great user experience either.

-Dan

On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <

bschucha...@pivotal.io>

wrote:


Let's yank it




On 9/4/18 5:04 PM, Sean Goller wrote:

If it's to get the release out, I'm fine with reverting. I don't

like

it,

but I'm not willing to die on that hill. :)

-S.

On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 

wrote:

Spitting this into a separate thread.

I see the issue. The two minute timeout is the constructor for
AcceptorImpl, where it retries to bind for 2 minutes.

That behavior makes sense for CacheServer.start.

But it doesn't make sense for the new logic in

GatewayReceiver.start()

from
GEODE-5591. That code is trying to use CacheServer.start to

scan

for

an

available port, trying each port in a range. That free port

finding

logic

really doesn't want to have two minutes of retries for each

port.

It

seems
like we need to rework the fix for GEODE-5591.

Does it make sense to hold up the release to rework this fix,

or

should

we
just revert it? Have we switched concourse over to using alpine

linux,

which I think was the original motivation for this fix?

-Dan

On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 

wrote:

Why is it waiting at all in this case? Where is this 2 minute

timeout

coming from?

-Dan

On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <


sai.boorlaga...@gmail.com


wrote:

So the issue is that it takes longer to start than previous

releases?

Also, is this wait time only when using Gfsh to create
gateway-receiver?

On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 

wrote:

Currently we have a minor issue in the release branch as

pointed

out

by

Barry O.

We will wait till a resolution is figured out for this

issue.

Steps:
1. create locator
2. start server --name=server1 --server-port=40404
3. start server --name=server2 --server-port=40405
4. create gateway-receiver --member=server1
5. create gateway-receiver --member=server2 `This gets stuck

for 2

minutes`


Is the 2 minute wait time acceptable? Should we document it?

When

we

revert


GEODE-5591, this issue does not happen.

Regards
Nabarun Nag






Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Udo Kohlmeyer

Thank you. I must have missed that :)


On 9/5/18 10:54, Nabarun Nag wrote:

@Udo I have mentioned in an earlier mail that it will be reverted in
develop and then cherry picked to develop. release/1.7.0 branch has not
being published yet, as it is undergoing preliminary tests before release
candidate is published.

Regards
Nabarun Nag

On Wed, Sep 5, 2018 at 10:46 AM Udo Kohlmeyer  wrote:


Did we also revert this in 1.7? I assume it has, but not directly stated
here.


On 9/5/18 10:20, Nabarun Nag wrote:

GEODE-5591 has been reverted in develop
ref: 901da27f227a8ce2b7d6b681619782a1accd9330

Regards
Nabarun Nag

On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 

wrote:

+1 for reverting in both places.

I see that there is already an isGatewayReceiver flag in the

AcceptorImpl

constructor.  It's not ideal, but could we use this flag to prevent the

2

minute retry logic for happening if this flag is true?

Ryan

On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
lhughesgodf...@pivotal.io> wrote:


+1 for reverting in both places.

On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:


+1 for reverting in both places. The current fix is not better, that's

why

we are reverting it on the release branch!

-Dan

On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 

wrote:

I’m not ok with reverting in develop. Revert in 1.7 and modify in

develop.

We shouldn’t go backwards in develop. The current fix is better than

the

bug it fixes.


On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:

If everyone is okay with it, I will revert that change in develop

and

then

cherry pick it to release/1.7.0 branch.
Please do comment.

Regards
Nabarun Nag



On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 

wrote:

+1 to yank it and rework the fix.

Gester's change helps, but it just means that you will sometimes

randomly

have a 2 minute delay starting up a gateway receiver. I don't

think

that is

a great user experience either.

-Dan

On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <

bschucha...@pivotal.io>

wrote:


Let's yank it




On 9/4/18 5:04 PM, Sean Goller wrote:

If it's to get the release out, I'm fine with reverting. I don't

like

it,

but I'm not willing to die on that hill. :)

-S.

On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 

wrote:

Spitting this into a separate thread.

I see the issue. The two minute timeout is the constructor for
AcceptorImpl, where it retries to bind for 2 minutes.

That behavior makes sense for CacheServer.start.

But it doesn't make sense for the new logic in

GatewayReceiver.start()

from
GEODE-5591. That code is trying to use CacheServer.start to

scan

for

an

available port, trying each port in a range. That free port

finding

logic

really doesn't want to have two minutes of retries for each

port.

It

seems
like we need to rework the fix for GEODE-5591.

Does it make sense to hold up the release to rework this fix,

or

should

we
just revert it? Have we switched concourse over to using alpine

linux,

which I think was the original motivation for this fix?

-Dan

On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 

wrote:

Why is it waiting at all in this case? Where is this 2 minute

timeout

coming from?

-Dan

On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <


sai.boorlaga...@gmail.com


wrote:

So the issue is that it takes longer to start than previous

releases?

Also, is this wait time only when using Gfsh to create
gateway-receiver?

On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 

wrote:

Currently we have a minor issue in the release branch as

pointed

out

by

Barry O.

We will wait till a resolution is figured out for this

issue.

Steps:
1. create locator
2. start server --name=server1 --server-port=40404
3. start server --name=server2 --server-port=40405
4. create gateway-receiver --member=server1
5. create gateway-receiver --member=server2 `This gets stuck

for 2

minutes`


Is the 2 minute wait time acceptable? Should we document it?

When

we

revert


GEODE-5591, this issue does not happen.

Regards
Nabarun Nag








Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Nabarun Nag
*correction: cherry picked to release/1.7.0

On Wed, Sep 5, 2018 at 10:54 AM Nabarun Nag  wrote:

> @Udo I have mentioned in an earlier mail that it will be reverted in
> develop and then cherry picked to develop. release/1.7.0 branch has not
> being published yet, as it is undergoing preliminary tests before release
> candidate is published.
>
> Regards
> Nabarun Nag
>
> On Wed, Sep 5, 2018 at 10:46 AM Udo Kohlmeyer  wrote:
>
>> Did we also revert this in 1.7? I assume it has, but not directly stated
>> here.
>>
>>
>> On 9/5/18 10:20, Nabarun Nag wrote:
>> > GEODE-5591 has been reverted in develop
>> > ref: 901da27f227a8ce2b7d6b681619782a1accd9330
>> >
>> > Regards
>> > Nabarun Nag
>> >
>> > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 
>> wrote:
>> >
>> >> +1 for reverting in both places.
>> >>
>> >> I see that there is already an isGatewayReceiver flag in the
>> AcceptorImpl
>> >> constructor.  It's not ideal, but could we use this flag to prevent
>> the 2
>> >> minute retry logic for happening if this flag is true?
>> >>
>> >> Ryan
>> >>
>> >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
>> >> lhughesgodf...@pivotal.io> wrote:
>> >>
>> >>> +1 for reverting in both places.
>> >>>
>> >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
>> >>>
>>  +1 for reverting in both places. The current fix is not better,
>> that's
>> >>> why
>>  we are reverting it on the release branch!
>> 
>>  -Dan
>> 
>>  On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
>> >>> wrote:
>> > I’m not ok with reverting in develop. Revert in 1.7 and modify in
>>  develop.
>> > We shouldn’t go backwards in develop. The current fix is better than
>> >>> the
>> > bug it fixes.
>> >
>> >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
>> >>
>> >> If everyone is okay with it, I will revert that change in develop
>> >> and
>> > then
>> >> cherry pick it to release/1.7.0 branch.
>> >> Please do comment.
>> >>
>> >> Regards
>> >> Nabarun Nag
>> >>
>> >>
>> >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
>> >> wrote:
>> >>> +1 to yank it and rework the fix.
>> >>>
>> >>> Gester's change helps, but it just means that you will sometimes
>> > randomly
>> >>> have a 2 minute delay starting up a gateway receiver. I don't
>> >> think
>> > that is
>> >>> a great user experience either.
>> >>>
>> >>> -Dan
>> >>>
>> >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
>> > bschucha...@pivotal.io>
>> >>> wrote:
>> >>>
>>  Let's yank it
>> 
>> 
>> 
>> > On 9/4/18 5:04 PM, Sean Goller wrote:
>> >
>> > If it's to get the release out, I'm fine with reverting. I don't
>>  like
>> >>> it,
>> > but I'm not willing to die on that hill. :)
>> >
>> > -S.
>> >
>> > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
>> >>> wrote:
>> > Spitting this into a separate thread.
>> >> I see the issue. The two minute timeout is the constructor for
>> >> AcceptorImpl, where it retries to bind for 2 minutes.
>> >>
>> >> That behavior makes sense for CacheServer.start.
>> >>
>> >> But it doesn't make sense for the new logic in
>> > GatewayReceiver.start()
>> >> from
>> >> GEODE-5591. That code is trying to use CacheServer.start to
>> >> scan
>>  for
>> > an
>> >> available port, trying each port in a range. That free port
>> >>> finding
>> >>> logic
>> >> really doesn't want to have two minutes of retries for each
>> >> port.
>>  It
>> >> seems
>> >> like we need to rework the fix for GEODE-5591.
>> >>
>> >> Does it make sense to hold up the release to rework this fix,
>> >> or
>> > should
>> >> we
>> >> just revert it? Have we switched concourse over to using alpine
>> > linux,
>> >> which I think was the original motivation for this fix?
>> >>
>> >> -Dan
>> >>
>> >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
>>  wrote:
>> >> Why is it waiting at all in this case? Where is this 2 minute
>>  timeout
>> >>> coming from?
>> >>>
>> >>> -Dan
>> >>>
>> >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
>> >>>
>> >> sai.boorlaga...@gmail.com
>> >>
>> >>> wrote:
>>  So the issue is that it takes longer to start than previous
>> > releases?
>>  Also, is this wait time only when using Gfsh to create
>>  gateway-receiver?
>> 
>>  On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
>> > wrote:
>>  Currently we have a minor issue in the release branch as
>> >>> pointed
>> > out
>>  by
>> >>> Barry O.
>> > We will wait till a resolution is 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Xiaojian Zhou
The previous fix did not improve anything on 2-miniute-timeout.

On Wed, Sep 5, 2018 at 10:52 AM, Anthony Baker  wrote:

> Gester,
>
> Clearly the prior implementation had some problems, but except in
> pathological cases it provided the behavior users expected.  That’s why I
> think we need a characterization test(s) to show exactly what we want the
> behavior to be.  Merging in changes that make the user experience worse in
> the more common scenarios isn’t a good tradeoff IMO.  I see this work as
> integral to GEODE-5591 and shouldn’t be deferred to a separate ticket.
>
> Anthony
>
>
> > On Sep 5, 2018, at 10:43 AM, Xiaojian Zhou  wrote:
> >
> > The fix intend to resolve 2 issues:
> > 1) change the exception handling (for a linux version).
> > 2) prevent random picking port number to loop forever. In old code, for
> > example, if the range only contains one port, random will always pick the
> > same port and it will loop forever. The fix will stop after all available
> > ports in the range are tried. There's a test
> >
> > test_ValidateGatewayReceiverAttributes_WrongBindAddress
> >
> >
> > For 2-minute-wait, it's still possible. The fix did not resolve it
> > (when random() happened to return same port for different receiver in
> > the same member), but I did not make things worse either.
> >
> >
> > There's discussion on if we can reduce the 2-minute-timeout to a few
> > second. This is definitely another ticket.
> >
> > Regards
> >
> > Gester
> >
> >
> > On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker 
> wrote:
> >
> >> Before this improvement is re-merged I’d like to see:
> >>
> >> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2
> min
> >> when there’s a port conflict)
> >> 2) A test that demonstrates how the current logic is insufficient
> >>
> >> Anthony
> >>
> >>
> >>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag  wrote:
> >>>
> >>> GEODE-5591 has been reverted in develop
> >>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330
> >>>
> >>> Regards
> >>> Nabarun Nag
> >>>
> >>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 
> >> wrote:
> >>>
>  +1 for reverting in both places.
> 
>  I see that there is already an isGatewayReceiver flag in the
> >> AcceptorImpl
>  constructor.  It's not ideal, but could we use this flag to prevent
> the
> >> 2
>  minute retry logic for happening if this flag is true?
> 
>  Ryan
> 
>  On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
>  lhughesgodf...@pivotal.io> wrote:
> 
> > +1 for reverting in both places.
> >
> > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
> >
> >> +1 for reverting in both places. The current fix is not better,
> that's
> > why
> >> we are reverting it on the release branch!
> >>
> >> -Dan
> >>
> >> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> > wrote:
> >>
> >>> I’m not ok with reverting in develop. Revert in 1.7 and modify in
> >> develop.
> >>> We shouldn’t go backwards in develop. The current fix is better
> than
> > the
> >>> bug it fixes.
> >>>
>  On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> 
>  If everyone is okay with it, I will revert that change in develop
>  and
> >>> then
>  cherry pick it to release/1.7.0 branch.
>  Please do comment.
> 
>  Regards
>  Nabarun Nag
> 
> 
> > On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
>  wrote:
> >
> > +1 to yank it and rework the fix.
> >
> > Gester's change helps, but it just means that you will sometimes
> >>> randomly
> > have a 2 minute delay starting up a gateway receiver. I don't
>  think
> >>> that is
> > a great user experience either.
> >
> > -Dan
> >
> > On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> >>> bschucha...@pivotal.io>
> > wrote:
> >
> >> Let's yank it
> >>
> >>
> >>
> >>> On 9/4/18 5:04 PM, Sean Goller wrote:
> >>>
> >>> If it's to get the release out, I'm fine with reverting. I
> don't
> >> like
> > it,
> >>> but I'm not willing to die on that hill. :)
> >>>
> >>> -S.
> >>>
> >>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> > wrote:
> >>>
> >>> Spitting this into a separate thread.
> 
>  I see the issue. The two minute timeout is the constructor for
>  AcceptorImpl, where it retries to bind for 2 minutes.
> 
>  That behavior makes sense for CacheServer.start.
> 
>  But it doesn't make sense for the new logic in
> >>> GatewayReceiver.start()
>  from
>  GEODE-5591. That code is trying to use CacheServer.start to
>  scan
> >> for
> >>> an
>  available port, trying 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Nabarun Nag
@Udo I have mentioned in an earlier mail that it will be reverted in
develop and then cherry picked to develop. release/1.7.0 branch has not
being published yet, as it is undergoing preliminary tests before release
candidate is published.

Regards
Nabarun Nag

On Wed, Sep 5, 2018 at 10:46 AM Udo Kohlmeyer  wrote:

> Did we also revert this in 1.7? I assume it has, but not directly stated
> here.
>
>
> On 9/5/18 10:20, Nabarun Nag wrote:
> > GEODE-5591 has been reverted in develop
> > ref: 901da27f227a8ce2b7d6b681619782a1accd9330
> >
> > Regards
> > Nabarun Nag
> >
> > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 
> wrote:
> >
> >> +1 for reverting in both places.
> >>
> >> I see that there is already an isGatewayReceiver flag in the
> AcceptorImpl
> >> constructor.  It's not ideal, but could we use this flag to prevent the
> 2
> >> minute retry logic for happening if this flag is true?
> >>
> >> Ryan
> >>
> >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
> >> lhughesgodf...@pivotal.io> wrote:
> >>
> >>> +1 for reverting in both places.
> >>>
> >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
> >>>
>  +1 for reverting in both places. The current fix is not better, that's
> >>> why
>  we are reverting it on the release branch!
> 
>  -Dan
> 
>  On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> >>> wrote:
> > I’m not ok with reverting in develop. Revert in 1.7 and modify in
>  develop.
> > We shouldn’t go backwards in develop. The current fix is better than
> >>> the
> > bug it fixes.
> >
> >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> >>
> >> If everyone is okay with it, I will revert that change in develop
> >> and
> > then
> >> cherry pick it to release/1.7.0 branch.
> >> Please do comment.
> >>
> >> Regards
> >> Nabarun Nag
> >>
> >>
> >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
> >> wrote:
> >>> +1 to yank it and rework the fix.
> >>>
> >>> Gester's change helps, but it just means that you will sometimes
> > randomly
> >>> have a 2 minute delay starting up a gateway receiver. I don't
> >> think
> > that is
> >>> a great user experience either.
> >>>
> >>> -Dan
> >>>
> >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> > bschucha...@pivotal.io>
> >>> wrote:
> >>>
>  Let's yank it
> 
> 
> 
> > On 9/4/18 5:04 PM, Sean Goller wrote:
> >
> > If it's to get the release out, I'm fine with reverting. I don't
>  like
> >>> it,
> > but I'm not willing to die on that hill. :)
> >
> > -S.
> >
> > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> >>> wrote:
> > Spitting this into a separate thread.
> >> I see the issue. The two minute timeout is the constructor for
> >> AcceptorImpl, where it retries to bind for 2 minutes.
> >>
> >> That behavior makes sense for CacheServer.start.
> >>
> >> But it doesn't make sense for the new logic in
> > GatewayReceiver.start()
> >> from
> >> GEODE-5591. That code is trying to use CacheServer.start to
> >> scan
>  for
> > an
> >> available port, trying each port in a range. That free port
> >>> finding
> >>> logic
> >> really doesn't want to have two minutes of retries for each
> >> port.
>  It
> >> seems
> >> like we need to rework the fix for GEODE-5591.
> >>
> >> Does it make sense to hold up the release to rework this fix,
> >> or
> > should
> >> we
> >> just revert it? Have we switched concourse over to using alpine
> > linux,
> >> which I think was the original motivation for this fix?
> >>
> >> -Dan
> >>
> >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
>  wrote:
> >> Why is it waiting at all in this case? Where is this 2 minute
>  timeout
> >>> coming from?
> >>>
> >>> -Dan
> >>>
> >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> >>>
> >> sai.boorlaga...@gmail.com
> >>
> >>> wrote:
>  So the issue is that it takes longer to start than previous
> > releases?
>  Also, is this wait time only when using Gfsh to create
>  gateway-receiver?
> 
>  On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> > wrote:
>  Currently we have a minor issue in the release branch as
> >>> pointed
> > out
>  by
> >>> Barry O.
> > We will wait till a resolution is figured out for this
> >> issue.
> > Steps:
> > 1. create locator
> > 2. start server --name=server1 --server-port=40404
> > 3. start server --name=server2 --server-port=40405
> > 4. create gateway-receiver 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Anthony Baker
Gester,

Clearly the prior implementation had some problems, but except in pathological 
cases it provided the behavior users expected.  That’s why I think we need a 
characterization test(s) to show exactly what we want the behavior to be.  
Merging in changes that make the user experience worse in the more common 
scenarios isn’t a good tradeoff IMO.  I see this work as integral to GEODE-5591 
and shouldn’t be deferred to a separate ticket.

Anthony


> On Sep 5, 2018, at 10:43 AM, Xiaojian Zhou  wrote:
> 
> The fix intend to resolve 2 issues:
> 1) change the exception handling (for a linux version).
> 2) prevent random picking port number to loop forever. In old code, for
> example, if the range only contains one port, random will always pick the
> same port and it will loop forever. The fix will stop after all available
> ports in the range are tried. There's a test
> 
> test_ValidateGatewayReceiverAttributes_WrongBindAddress
> 
> 
> For 2-minute-wait, it's still possible. The fix did not resolve it
> (when random() happened to return same port for different receiver in
> the same member), but I did not make things worse either.
> 
> 
> There's discussion on if we can reduce the 2-minute-timeout to a few
> second. This is definitely another ticket.
> 
> Regards
> 
> Gester
> 
> 
> On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker  wrote:
> 
>> Before this improvement is re-merged I’d like to see:
>> 
>> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min
>> when there’s a port conflict)
>> 2) A test that demonstrates how the current logic is insufficient
>> 
>> Anthony
>> 
>> 
>>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag  wrote:
>>> 
>>> GEODE-5591 has been reverted in develop
>>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330
>>> 
>>> Regards
>>> Nabarun Nag
>>> 
>>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 
>> wrote:
>>> 
 +1 for reverting in both places.
 
 I see that there is already an isGatewayReceiver flag in the
>> AcceptorImpl
 constructor.  It's not ideal, but could we use this flag to prevent the
>> 2
 minute retry logic for happening if this flag is true?
 
 Ryan
 
 On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
 lhughesgodf...@pivotal.io> wrote:
 
> +1 for reverting in both places.
> 
> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
> 
>> +1 for reverting in both places. The current fix is not better, that's
> why
>> we are reverting it on the release branch!
>> 
>> -Dan
>> 
>> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> wrote:
>> 
>>> I’m not ok with reverting in develop. Revert in 1.7 and modify in
>> develop.
>>> We shouldn’t go backwards in develop. The current fix is better than
> the
>>> bug it fixes.
>>> 
 On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
 
 If everyone is okay with it, I will revert that change in develop
 and
>>> then
 cherry pick it to release/1.7.0 branch.
 Please do comment.
 
 Regards
 Nabarun Nag
 
 
> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
 wrote:
> 
> +1 to yank it and rework the fix.
> 
> Gester's change helps, but it just means that you will sometimes
>>> randomly
> have a 2 minute delay starting up a gateway receiver. I don't
 think
>>> that is
> a great user experience either.
> 
> -Dan
> 
> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
>>> bschucha...@pivotal.io>
> wrote:
> 
>> Let's yank it
>> 
>> 
>> 
>>> On 9/4/18 5:04 PM, Sean Goller wrote:
>>> 
>>> If it's to get the release out, I'm fine with reverting. I don't
>> like
> it,
>>> but I'm not willing to die on that hill. :)
>>> 
>>> -S.
>>> 
>>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> wrote:
>>> 
>>> Spitting this into a separate thread.
 
 I see the issue. The two minute timeout is the constructor for
 AcceptorImpl, where it retries to bind for 2 minutes.
 
 That behavior makes sense for CacheServer.start.
 
 But it doesn't make sense for the new logic in
>>> GatewayReceiver.start()
 from
 GEODE-5591. That code is trying to use CacheServer.start to
 scan
>> for
>>> an
 available port, trying each port in a range. That free port
> finding
> logic
 really doesn't want to have two minutes of retries for each
 port.
>> It
 seems
 like we need to rework the fix for GEODE-5591.
 
 Does it make sense to hold up the release to rework this fix,
 or
>>> should
 we

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Udo Kohlmeyer
Did we also revert this in 1.7? I assume it has, but not directly stated 
here.



On 9/5/18 10:20, Nabarun Nag wrote:

GEODE-5591 has been reverted in develop
ref: 901da27f227a8ce2b7d6b681619782a1accd9330

Regards
Nabarun Nag

On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon  wrote:


+1 for reverting in both places.

I see that there is already an isGatewayReceiver flag in the AcceptorImpl
constructor.  It's not ideal, but could we use this flag to prevent the 2
minute retry logic for happening if this flag is true?

Ryan

On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
lhughesgodf...@pivotal.io> wrote:


+1 for reverting in both places.

On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:


+1 for reverting in both places. The current fix is not better, that's

why

we are reverting it on the release branch!

-Dan

On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 

wrote:

I’m not ok with reverting in develop. Revert in 1.7 and modify in

develop.

We shouldn’t go backwards in develop. The current fix is better than

the

bug it fixes.


On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:

If everyone is okay with it, I will revert that change in develop

and

then

cherry pick it to release/1.7.0 branch.
Please do comment.

Regards
Nabarun Nag



On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 

wrote:

+1 to yank it and rework the fix.

Gester's change helps, but it just means that you will sometimes

randomly

have a 2 minute delay starting up a gateway receiver. I don't

think

that is

a great user experience either.

-Dan

On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <

bschucha...@pivotal.io>

wrote:


Let's yank it




On 9/4/18 5:04 PM, Sean Goller wrote:

If it's to get the release out, I'm fine with reverting. I don't

like

it,

but I'm not willing to die on that hill. :)

-S.

On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 

wrote:

Spitting this into a separate thread.

I see the issue. The two minute timeout is the constructor for
AcceptorImpl, where it retries to bind for 2 minutes.

That behavior makes sense for CacheServer.start.

But it doesn't make sense for the new logic in

GatewayReceiver.start()

from
GEODE-5591. That code is trying to use CacheServer.start to

scan

for

an

available port, trying each port in a range. That free port

finding

logic

really doesn't want to have two minutes of retries for each

port.

It

seems
like we need to rework the fix for GEODE-5591.

Does it make sense to hold up the release to rework this fix,

or

should

we
just revert it? Have we switched concourse over to using alpine

linux,

which I think was the original motivation for this fix?

-Dan

On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 

wrote:

Why is it waiting at all in this case? Where is this 2 minute

timeout

coming from?

-Dan

On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <


sai.boorlaga...@gmail.com


wrote:

So the issue is that it takes longer to start than previous

releases?

Also, is this wait time only when using Gfsh to create
gateway-receiver?

On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 

wrote:

Currently we have a minor issue in the release branch as

pointed

out

by

Barry O.

We will wait till a resolution is figured out for this

issue.

Steps:
1. create locator
2. start server --name=server1 --server-port=40404
3. start server --name=server2 --server-port=40405
4. create gateway-receiver --member=server1
5. create gateway-receiver --member=server2 `This gets stuck

for 2

minutes`


Is the 2 minute wait time acceptable? Should we document it?

When

we

revert


GEODE-5591, this issue does not happen.

Regards
Nabarun Nag






Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Xiaojian Zhou
The fix intend to resolve 2 issues:
1) change the exception handling (for a linux version).
2) prevent random picking port number to loop forever. In old code, for
example, if the range only contains one port, random will always pick the
same port and it will loop forever. The fix will stop after all available
ports in the range are tried. There's a test

test_ValidateGatewayReceiverAttributes_WrongBindAddress


For 2-minute-wait, it's still possible. The fix did not resolve it
(when random() happened to return same port for different receiver in
the same member), but I did not make things worse either.


There's discussion on if we can reduce the 2-minute-timeout to a few
second. This is definitely another ticket.

Regards

Gester


On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker  wrote:

> Before this improvement is re-merged I’d like to see:
>
> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min
> when there’s a port conflict)
> 2) A test that demonstrates how the current logic is insufficient
>
> Anthony
>
>
> > On Sep 5, 2018, at 10:20 AM, Nabarun Nag  wrote:
> >
> > GEODE-5591 has been reverted in develop
> > ref: 901da27f227a8ce2b7d6b681619782a1accd9330
> >
> > Regards
> > Nabarun Nag
> >
> > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon 
> wrote:
> >
> >> +1 for reverting in both places.
> >>
> >> I see that there is already an isGatewayReceiver flag in the
> AcceptorImpl
> >> constructor.  It's not ideal, but could we use this flag to prevent the
> 2
> >> minute retry logic for happening if this flag is true?
> >>
> >> Ryan
> >>
> >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
> >> lhughesgodf...@pivotal.io> wrote:
> >>
> >>> +1 for reverting in both places.
> >>>
> >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
> >>>
>  +1 for reverting in both places. The current fix is not better, that's
> >>> why
>  we are reverting it on the release branch!
> 
>  -Dan
> 
>  On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> >>> wrote:
> 
> > I’m not ok with reverting in develop. Revert in 1.7 and modify in
>  develop.
> > We shouldn’t go backwards in develop. The current fix is better than
> >>> the
> > bug it fixes.
> >
> >> On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> >>
> >> If everyone is okay with it, I will revert that change in develop
> >> and
> > then
> >> cherry pick it to release/1.7.0 branch.
> >> Please do comment.
> >>
> >> Regards
> >> Nabarun Nag
> >>
> >>
> >>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
> >> wrote:
> >>>
> >>> +1 to yank it and rework the fix.
> >>>
> >>> Gester's change helps, but it just means that you will sometimes
> > randomly
> >>> have a 2 minute delay starting up a gateway receiver. I don't
> >> think
> > that is
> >>> a great user experience either.
> >>>
> >>> -Dan
> >>>
> >>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> > bschucha...@pivotal.io>
> >>> wrote:
> >>>
>  Let's yank it
> 
> 
> 
> > On 9/4/18 5:04 PM, Sean Goller wrote:
> >
> > If it's to get the release out, I'm fine with reverting. I don't
>  like
> >>> it,
> > but I'm not willing to die on that hill. :)
> >
> > -S.
> >
> > On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> >>> wrote:
> >
> > Spitting this into a separate thread.
> >>
> >> I see the issue. The two minute timeout is the constructor for
> >> AcceptorImpl, where it retries to bind for 2 minutes.
> >>
> >> That behavior makes sense for CacheServer.start.
> >>
> >> But it doesn't make sense for the new logic in
> > GatewayReceiver.start()
> >> from
> >> GEODE-5591. That code is trying to use CacheServer.start to
> >> scan
>  for
> > an
> >> available port, trying each port in a range. That free port
> >>> finding
> >>> logic
> >> really doesn't want to have two minutes of retries for each
> >> port.
>  It
> >> seems
> >> like we need to rework the fix for GEODE-5591.
> >>
> >> Does it make sense to hold up the release to rework this fix,
> >> or
> > should
> >> we
> >> just revert it? Have we switched concourse over to using alpine
> > linux,
> >> which I think was the original motivation for this fix?
> >>
> >> -Dan
> >>
> >> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
>  wrote:
> >>
> >> Why is it waiting at all in this case? Where is this 2 minute
>  timeout
> >>> coming from?
> >>>
> >>> -Dan
> >>>
> >>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> >>>
> >> sai.boorlaga...@gmail.com
> >>
> >>> wrote:
>  So the 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Anthony Baker
Before this improvement is re-merged I’d like to see:

1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min when 
there’s a port conflict)
2) A test that demonstrates how the current logic is insufficient

Anthony


> On Sep 5, 2018, at 10:20 AM, Nabarun Nag  wrote:
> 
> GEODE-5591 has been reverted in develop
> ref: 901da27f227a8ce2b7d6b681619782a1accd9330
> 
> Regards
> Nabarun Nag
> 
> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon  wrote:
> 
>> +1 for reverting in both places.
>> 
>> I see that there is already an isGatewayReceiver flag in the AcceptorImpl
>> constructor.  It's not ideal, but could we use this flag to prevent the 2
>> minute retry logic for happening if this flag is true?
>> 
>> Ryan
>> 
>> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
>> lhughesgodf...@pivotal.io> wrote:
>> 
>>> +1 for reverting in both places.
>>> 
>>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
>>> 
 +1 for reverting in both places. The current fix is not better, that's
>>> why
 we are reverting it on the release branch!
 
 -Dan
 
 On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
>>> wrote:
 
> I’m not ok with reverting in develop. Revert in 1.7 and modify in
 develop.
> We shouldn’t go backwards in develop. The current fix is better than
>>> the
> bug it fixes.
> 
>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
>> 
>> If everyone is okay with it, I will revert that change in develop
>> and
> then
>> cherry pick it to release/1.7.0 branch.
>> Please do comment.
>> 
>> Regards
>> Nabarun Nag
>> 
>> 
>>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
>> wrote:
>>> 
>>> +1 to yank it and rework the fix.
>>> 
>>> Gester's change helps, but it just means that you will sometimes
> randomly
>>> have a 2 minute delay starting up a gateway receiver. I don't
>> think
> that is
>>> a great user experience either.
>>> 
>>> -Dan
>>> 
>>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> bschucha...@pivotal.io>
>>> wrote:
>>> 
 Let's yank it
 
 
 
> On 9/4/18 5:04 PM, Sean Goller wrote:
> 
> If it's to get the release out, I'm fine with reverting. I don't
 like
>>> it,
> but I'm not willing to die on that hill. :)
> 
> -S.
> 
> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
>>> wrote:
> 
> Spitting this into a separate thread.
>> 
>> I see the issue. The two minute timeout is the constructor for
>> AcceptorImpl, where it retries to bind for 2 minutes.
>> 
>> That behavior makes sense for CacheServer.start.
>> 
>> But it doesn't make sense for the new logic in
> GatewayReceiver.start()
>> from
>> GEODE-5591. That code is trying to use CacheServer.start to
>> scan
 for
> an
>> available port, trying each port in a range. That free port
>>> finding
>>> logic
>> really doesn't want to have two minutes of retries for each
>> port.
 It
>> seems
>> like we need to rework the fix for GEODE-5591.
>> 
>> Does it make sense to hold up the release to rework this fix,
>> or
> should
>> we
>> just revert it? Have we switched concourse over to using alpine
> linux,
>> which I think was the original motivation for this fix?
>> 
>> -Dan
>> 
>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
 wrote:
>> 
>> Why is it waiting at all in this case? Where is this 2 minute
 timeout
>>> coming from?
>>> 
>>> -Dan
>>> 
>>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
>>> 
>> sai.boorlaga...@gmail.com
>> 
>>> wrote:
 So the issue is that it takes longer to start than previous
> releases?
 Also, is this wait time only when using Gfsh to create
 gateway-receiver?
 
 On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> wrote:
 
 Currently we have a minor issue in the release branch as
>>> pointed
> out
> 
 by
>> 
>>> Barry O.
> We will wait till a resolution is figured out for this
>> issue.
> 
> Steps:
> 1. create locator
> 2. start server --name=server1 --server-port=40404
> 3. start server --name=server2 --server-port=40405
> 4. create gateway-receiver --member=server1
> 5. create gateway-receiver --member=server2 `This gets stuck
 for 2
> 
 minutes`
 
> Is the 2 minute wait time acceptable? Should we document it?
 When
> we
> 
 revert
 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Xiaojian Zhou
Well, I found it's already reverted.

But I think we don't have to.

After discussed with Jason, I worked out a new fix. It kept previous 5591's
intention of exception handling and improved on assigning the port.

The port is now checked if available, so it will also resolve 2 minutes
timeout issue for the retry. (Or at least will not make things worse).

On Wed, Sep 5, 2018 at 10:14 AM, Ryan McMahon  wrote:

> +1 for reverting in both places.
>
> I see that there is already an isGatewayReceiver flag in the AcceptorImpl
> constructor.  It's not ideal, but could we use this flag to prevent the 2
> minute retry logic for happening if this flag is true?
>
> Ryan
>
> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
> lhughesgodf...@pivotal.io> wrote:
>
> > +1 for reverting in both places.
> >
> > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
> >
> > > +1 for reverting in both places. The current fix is not better, that's
> > why
> > > we are reverting it on the release branch!
> > >
> > > -Dan
> > >
> > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> > wrote:
> > >
> > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in
> > > develop.
> > > > We shouldn’t go backwards in develop. The current fix is better than
> > the
> > > > bug it fixes.
> > > >
> > > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> > > > >
> > > > > If everyone is okay with it, I will revert that change in develop
> and
> > > > then
> > > > > cherry pick it to release/1.7.0 branch.
> > > > > Please do comment.
> > > > >
> > > > > Regards
> > > > > Nabarun Nag
> > > > >
> > > > >
> > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
> wrote:
> > > > >>
> > > > >> +1 to yank it and rework the fix.
> > > > >>
> > > > >> Gester's change helps, but it just means that you will sometimes
> > > > randomly
> > > > >> have a 2 minute delay starting up a gateway receiver. I don't
> think
> > > > that is
> > > > >> a great user experience either.
> > > > >>
> > > > >> -Dan
> > > > >>
> > > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> > > > bschucha...@pivotal.io>
> > > > >> wrote:
> > > > >>
> > > > >>> Let's yank it
> > > > >>>
> > > > >>>
> > > > >>>
> > > >  On 9/4/18 5:04 PM, Sean Goller wrote:
> > > > 
> > > >  If it's to get the release out, I'm fine with reverting. I don't
> > > like
> > > > >> it,
> > > >  but I'm not willing to die on that hill. :)
> > > > 
> > > >  -S.
> > > > 
> > > >  On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> > wrote:
> > > > 
> > > >  Spitting this into a separate thread.
> > > > >
> > > > > I see the issue. The two minute timeout is the constructor for
> > > > > AcceptorImpl, where it retries to bind for 2 minutes.
> > > > >
> > > > > That behavior makes sense for CacheServer.start.
> > > > >
> > > > > But it doesn't make sense for the new logic in
> > > > GatewayReceiver.start()
> > > > > from
> > > > > GEODE-5591. That code is trying to use CacheServer.start to
> scan
> > > for
> > > > an
> > > > > available port, trying each port in a range. That free port
> > finding
> > > > >> logic
> > > > > really doesn't want to have two minutes of retries for each
> port.
> > > It
> > > > > seems
> > > > > like we need to rework the fix for GEODE-5591.
> > > > >
> > > > > Does it make sense to hold up the release to rework this fix,
> or
> > > > should
> > > > > we
> > > > > just revert it? Have we switched concourse over to using alpine
> > > > linux,
> > > > > which I think was the original motivation for this fix?
> > > > >
> > > > > -Dan
> > > > >
> > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
> > > wrote:
> > > > >
> > > > > Why is it waiting at all in this case? Where is this 2 minute
> > > timeout
> > > > >> coming from?
> > > > >>
> > > > >> -Dan
> > > > >>
> > > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> > > > >>
> > > > > sai.boorlaga...@gmail.com
> > > > >
> > > > >> wrote:
> > > > >>> So the issue is that it takes longer to start than previous
> > > > releases?
> > > > >>> Also, is this wait time only when using Gfsh to create
> > > > >>> gateway-receiver?
> > > > >>>
> > > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> > > > wrote:
> > > > >>>
> > > > >>> Currently we have a minor issue in the release branch as
> > pointed
> > > > out
> > > > 
> > > > >>> by
> > > > >
> > > > >> Barry O.
> > > >  We will wait till a resolution is figured out for this
> issue.
> > > > 
> > > >  Steps:
> > > >  1. create locator
> > > >  2. start server --name=server1 --server-port=40404
> > > >  3. start server --name=server2 --server-port=40405
> > > >  4. create gateway-receiver --member=server1
> > > >  5. create gateway-receiver --member=server2 `This gets stuck
> > > for 2
> > 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Nabarun Nag
GEODE-5591 has been reverted in develop
ref: 901da27f227a8ce2b7d6b681619782a1accd9330

Regards
Nabarun Nag

On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon  wrote:

> +1 for reverting in both places.
>
> I see that there is already an isGatewayReceiver flag in the AcceptorImpl
> constructor.  It's not ideal, but could we use this flag to prevent the 2
> minute retry logic for happening if this flag is true?
>
> Ryan
>
> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
> lhughesgodf...@pivotal.io> wrote:
>
> > +1 for reverting in both places.
> >
> > On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
> >
> > > +1 for reverting in both places. The current fix is not better, that's
> > why
> > > we are reverting it on the release branch!
> > >
> > > -Dan
> > >
> > > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> > wrote:
> > >
> > > > I’m not ok with reverting in develop. Revert in 1.7 and modify in
> > > develop.
> > > > We shouldn’t go backwards in develop. The current fix is better than
> > the
> > > > bug it fixes.
> > > >
> > > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> > > > >
> > > > > If everyone is okay with it, I will revert that change in develop
> and
> > > > then
> > > > > cherry pick it to release/1.7.0 branch.
> > > > > Please do comment.
> > > > >
> > > > > Regards
> > > > > Nabarun Nag
> > > > >
> > > > >
> > > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith 
> wrote:
> > > > >>
> > > > >> +1 to yank it and rework the fix.
> > > > >>
> > > > >> Gester's change helps, but it just means that you will sometimes
> > > > randomly
> > > > >> have a 2 minute delay starting up a gateway receiver. I don't
> think
> > > > that is
> > > > >> a great user experience either.
> > > > >>
> > > > >> -Dan
> > > > >>
> > > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> > > > bschucha...@pivotal.io>
> > > > >> wrote:
> > > > >>
> > > > >>> Let's yank it
> > > > >>>
> > > > >>>
> > > > >>>
> > > >  On 9/4/18 5:04 PM, Sean Goller wrote:
> > > > 
> > > >  If it's to get the release out, I'm fine with reverting. I don't
> > > like
> > > > >> it,
> > > >  but I'm not willing to die on that hill. :)
> > > > 
> > > >  -S.
> > > > 
> > > >  On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> > wrote:
> > > > 
> > > >  Spitting this into a separate thread.
> > > > >
> > > > > I see the issue. The two minute timeout is the constructor for
> > > > > AcceptorImpl, where it retries to bind for 2 minutes.
> > > > >
> > > > > That behavior makes sense for CacheServer.start.
> > > > >
> > > > > But it doesn't make sense for the new logic in
> > > > GatewayReceiver.start()
> > > > > from
> > > > > GEODE-5591. That code is trying to use CacheServer.start to
> scan
> > > for
> > > > an
> > > > > available port, trying each port in a range. That free port
> > finding
> > > > >> logic
> > > > > really doesn't want to have two minutes of retries for each
> port.
> > > It
> > > > > seems
> > > > > like we need to rework the fix for GEODE-5591.
> > > > >
> > > > > Does it make sense to hold up the release to rework this fix,
> or
> > > > should
> > > > > we
> > > > > just revert it? Have we switched concourse over to using alpine
> > > > linux,
> > > > > which I think was the original motivation for this fix?
> > > > >
> > > > > -Dan
> > > > >
> > > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
> > > wrote:
> > > > >
> > > > > Why is it waiting at all in this case? Where is this 2 minute
> > > timeout
> > > > >> coming from?
> > > > >>
> > > > >> -Dan
> > > > >>
> > > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> > > > >>
> > > > > sai.boorlaga...@gmail.com
> > > > >
> > > > >> wrote:
> > > > >>> So the issue is that it takes longer to start than previous
> > > > releases?
> > > > >>> Also, is this wait time only when using Gfsh to create
> > > > >>> gateway-receiver?
> > > > >>>
> > > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> > > > wrote:
> > > > >>>
> > > > >>> Currently we have a minor issue in the release branch as
> > pointed
> > > > out
> > > > 
> > > > >>> by
> > > > >
> > > > >> Barry O.
> > > >  We will wait till a resolution is figured out for this
> issue.
> > > > 
> > > >  Steps:
> > > >  1. create locator
> > > >  2. start server --name=server1 --server-port=40404
> > > >  3. start server --name=server2 --server-port=40405
> > > >  4. create gateway-receiver --member=server1
> > > >  5. create gateway-receiver --member=server2 `This gets stuck
> > > for 2
> > > > 
> > > > >>> minutes`
> > > > >>>
> > > >  Is the 2 minute wait time acceptable? Should we document it?
> > > When
> > > > we
> > > > 
> > > > >>> revert
> > > > >>>
> > > >  GEODE-5591, this issue 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Ryan McMahon
+1 for reverting in both places.

I see that there is already an isGatewayReceiver flag in the AcceptorImpl
constructor.  It's not ideal, but could we use this flag to prevent the 2
minute retry logic for happening if this flag is true?

Ryan

On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
lhughesgodf...@pivotal.io> wrote:

> +1 for reverting in both places.
>
> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:
>
> > +1 for reverting in both places. The current fix is not better, that's
> why
> > we are reverting it on the release branch!
> >
> > -Dan
> >
> > On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett 
> wrote:
> >
> > > I’m not ok with reverting in develop. Revert in 1.7 and modify in
> > develop.
> > > We shouldn’t go backwards in develop. The current fix is better than
> the
> > > bug it fixes.
> > >
> > > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> > > >
> > > > If everyone is okay with it, I will revert that change in develop and
> > > then
> > > > cherry pick it to release/1.7.0 branch.
> > > > Please do comment.
> > > >
> > > > Regards
> > > > Nabarun Nag
> > > >
> > > >
> > > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith  wrote:
> > > >>
> > > >> +1 to yank it and rework the fix.
> > > >>
> > > >> Gester's change helps, but it just means that you will sometimes
> > > randomly
> > > >> have a 2 minute delay starting up a gateway receiver. I don't think
> > > that is
> > > >> a great user experience either.
> > > >>
> > > >> -Dan
> > > >>
> > > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> > > bschucha...@pivotal.io>
> > > >> wrote:
> > > >>
> > > >>> Let's yank it
> > > >>>
> > > >>>
> > > >>>
> > >  On 9/4/18 5:04 PM, Sean Goller wrote:
> > > 
> > >  If it's to get the release out, I'm fine with reverting. I don't
> > like
> > > >> it,
> > >  but I'm not willing to die on that hill. :)
> > > 
> > >  -S.
> > > 
> > >  On Tue, Sep 4, 2018 at 4:38 PM Dan Smith 
> wrote:
> > > 
> > >  Spitting this into a separate thread.
> > > >
> > > > I see the issue. The two minute timeout is the constructor for
> > > > AcceptorImpl, where it retries to bind for 2 minutes.
> > > >
> > > > That behavior makes sense for CacheServer.start.
> > > >
> > > > But it doesn't make sense for the new logic in
> > > GatewayReceiver.start()
> > > > from
> > > > GEODE-5591. That code is trying to use CacheServer.start to scan
> > for
> > > an
> > > > available port, trying each port in a range. That free port
> finding
> > > >> logic
> > > > really doesn't want to have two minutes of retries for each port.
> > It
> > > > seems
> > > > like we need to rework the fix for GEODE-5591.
> > > >
> > > > Does it make sense to hold up the release to rework this fix, or
> > > should
> > > > we
> > > > just revert it? Have we switched concourse over to using alpine
> > > linux,
> > > > which I think was the original motivation for this fix?
> > > >
> > > > -Dan
> > > >
> > > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
> > wrote:
> > > >
> > > > Why is it waiting at all in this case? Where is this 2 minute
> > timeout
> > > >> coming from?
> > > >>
> > > >> -Dan
> > > >>
> > > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> > > >>
> > > > sai.boorlaga...@gmail.com
> > > >
> > > >> wrote:
> > > >>> So the issue is that it takes longer to start than previous
> > > releases?
> > > >>> Also, is this wait time only when using Gfsh to create
> > > >>> gateway-receiver?
> > > >>>
> > > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> > > wrote:
> > > >>>
> > > >>> Currently we have a minor issue in the release branch as
> pointed
> > > out
> > > 
> > > >>> by
> > > >
> > > >> Barry O.
> > >  We will wait till a resolution is figured out for this issue.
> > > 
> > >  Steps:
> > >  1. create locator
> > >  2. start server --name=server1 --server-port=40404
> > >  3. start server --name=server2 --server-port=40405
> > >  4. create gateway-receiver --member=server1
> > >  5. create gateway-receiver --member=server2 `This gets stuck
> > for 2
> > > 
> > > >>> minutes`
> > > >>>
> > >  Is the 2 minute wait time acceptable? Should we document it?
> > When
> > > we
> > > 
> > > >>> revert
> > > >>>
> > >  GEODE-5591, this issue does not happen.
> > > 
> > >  Regards
> > >  Nabarun Nag
> > > 
> > > 
> > > >>>
> > > >>
> > >
> >
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Lynn Hughes-Godfrey
+1 for reverting in both places.

On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith  wrote:

> +1 for reverting in both places. The current fix is not better, that's why
> we are reverting it on the release branch!
>
> -Dan
>
> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett  wrote:
>
> > I’m not ok with reverting in develop. Revert in 1.7 and modify in
> develop.
> > We shouldn’t go backwards in develop. The current fix is better than the
> > bug it fixes.
> >
> > > On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> > >
> > > If everyone is okay with it, I will revert that change in develop and
> > then
> > > cherry pick it to release/1.7.0 branch.
> > > Please do comment.
> > >
> > > Regards
> > > Nabarun Nag
> > >
> > >
> > >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith  wrote:
> > >>
> > >> +1 to yank it and rework the fix.
> > >>
> > >> Gester's change helps, but it just means that you will sometimes
> > randomly
> > >> have a 2 minute delay starting up a gateway receiver. I don't think
> > that is
> > >> a great user experience either.
> > >>
> > >> -Dan
> > >>
> > >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> > bschucha...@pivotal.io>
> > >> wrote:
> > >>
> > >>> Let's yank it
> > >>>
> > >>>
> > >>>
> >  On 9/4/18 5:04 PM, Sean Goller wrote:
> > 
> >  If it's to get the release out, I'm fine with reverting. I don't
> like
> > >> it,
> >  but I'm not willing to die on that hill. :)
> > 
> >  -S.
> > 
> >  On Tue, Sep 4, 2018 at 4:38 PM Dan Smith  wrote:
> > 
> >  Spitting this into a separate thread.
> > >
> > > I see the issue. The two minute timeout is the constructor for
> > > AcceptorImpl, where it retries to bind for 2 minutes.
> > >
> > > That behavior makes sense for CacheServer.start.
> > >
> > > But it doesn't make sense for the new logic in
> > GatewayReceiver.start()
> > > from
> > > GEODE-5591. That code is trying to use CacheServer.start to scan
> for
> > an
> > > available port, trying each port in a range. That free port finding
> > >> logic
> > > really doesn't want to have two minutes of retries for each port.
> It
> > > seems
> > > like we need to rework the fix for GEODE-5591.
> > >
> > > Does it make sense to hold up the release to rework this fix, or
> > should
> > > we
> > > just revert it? Have we switched concourse over to using alpine
> > linux,
> > > which I think was the original motivation for this fix?
> > >
> > > -Dan
> > >
> > > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith 
> wrote:
> > >
> > > Why is it waiting at all in this case? Where is this 2 minute
> timeout
> > >> coming from?
> > >>
> > >> -Dan
> > >>
> > >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> > >>
> > > sai.boorlaga...@gmail.com
> > >
> > >> wrote:
> > >>> So the issue is that it takes longer to start than previous
> > releases?
> > >>> Also, is this wait time only when using Gfsh to create
> > >>> gateway-receiver?
> > >>>
> > >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> > wrote:
> > >>>
> > >>> Currently we have a minor issue in the release branch as pointed
> > out
> > 
> > >>> by
> > >
> > >> Barry O.
> >  We will wait till a resolution is figured out for this issue.
> > 
> >  Steps:
> >  1. create locator
> >  2. start server --name=server1 --server-port=40404
> >  3. start server --name=server2 --server-port=40405
> >  4. create gateway-receiver --member=server1
> >  5. create gateway-receiver --member=server2 `This gets stuck
> for 2
> > 
> > >>> minutes`
> > >>>
> >  Is the 2 minute wait time acceptable? Should we document it?
> When
> > we
> > 
> > >>> revert
> > >>>
> >  GEODE-5591, this issue does not happen.
> > 
> >  Regards
> >  Nabarun Nag
> > 
> > 
> > >>>
> > >>
> >
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Dan Smith
+1 for reverting in both places. The current fix is not better, that's why
we are reverting it on the release branch!

-Dan

On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett  wrote:

> I’m not ok with reverting in develop. Revert in 1.7 and modify in develop.
> We shouldn’t go backwards in develop. The current fix is better than the
> bug it fixes.
>
> > On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> >
> > If everyone is okay with it, I will revert that change in develop and
> then
> > cherry pick it to release/1.7.0 branch.
> > Please do comment.
> >
> > Regards
> > Nabarun Nag
> >
> >
> >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith  wrote:
> >>
> >> +1 to yank it and rework the fix.
> >>
> >> Gester's change helps, but it just means that you will sometimes
> randomly
> >> have a 2 minute delay starting up a gateway receiver. I don't think
> that is
> >> a great user experience either.
> >>
> >> -Dan
> >>
> >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> bschucha...@pivotal.io>
> >> wrote:
> >>
> >>> Let's yank it
> >>>
> >>>
> >>>
>  On 9/4/18 5:04 PM, Sean Goller wrote:
> 
>  If it's to get the release out, I'm fine with reverting. I don't like
> >> it,
>  but I'm not willing to die on that hill. :)
> 
>  -S.
> 
>  On Tue, Sep 4, 2018 at 4:38 PM Dan Smith  wrote:
> 
>  Spitting this into a separate thread.
> >
> > I see the issue. The two minute timeout is the constructor for
> > AcceptorImpl, where it retries to bind for 2 minutes.
> >
> > That behavior makes sense for CacheServer.start.
> >
> > But it doesn't make sense for the new logic in
> GatewayReceiver.start()
> > from
> > GEODE-5591. That code is trying to use CacheServer.start to scan for
> an
> > available port, trying each port in a range. That free port finding
> >> logic
> > really doesn't want to have two minutes of retries for each port. It
> > seems
> > like we need to rework the fix for GEODE-5591.
> >
> > Does it make sense to hold up the release to rework this fix, or
> should
> > we
> > just revert it? Have we switched concourse over to using alpine
> linux,
> > which I think was the original motivation for this fix?
> >
> > -Dan
> >
> > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
> >
> > Why is it waiting at all in this case? Where is this 2 minute timeout
> >> coming from?
> >>
> >> -Dan
> >>
> >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> >>
> > sai.boorlaga...@gmail.com
> >
> >> wrote:
> >>> So the issue is that it takes longer to start than previous
> releases?
> >>> Also, is this wait time only when using Gfsh to create
> >>> gateway-receiver?
> >>>
> >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> wrote:
> >>>
> >>> Currently we have a minor issue in the release branch as pointed
> out
> 
> >>> by
> >
> >> Barry O.
>  We will wait till a resolution is figured out for this issue.
> 
>  Steps:
>  1. create locator
>  2. start server --name=server1 --server-port=40404
>  3. start server --name=server2 --server-port=40405
>  4. create gateway-receiver --member=server1
>  5. create gateway-receiver --member=server2 `This gets stuck for 2
> 
> >>> minutes`
> >>>
>  Is the 2 minute wait time acceptable? Should we document it? When
> we
> 
> >>> revert
> >>>
>  GEODE-5591, this issue does not happen.
> 
>  Regards
>  Nabarun Nag
> 
> 
> >>>
> >>
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Sai Boorlagadda
+1 to revert in 1.7 and leaving the fix on develop.

On Wed, Sep 5, 2018 at 9:47 AM Jacob Barrett  wrote:

> I’m not ok with reverting in develop. Revert in 1.7 and modify in develop.
> We shouldn’t go backwards in develop. The current fix is better than the
> bug it fixes.
>
> > On Sep 5, 2018, at 9:40 AM, Nabarun Nag  wrote:
> >
> > If everyone is okay with it, I will revert that change in develop and
> then
> > cherry pick it to release/1.7.0 branch.
> > Please do comment.
> >
> > Regards
> > Nabarun Nag
> >
> >
> >> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith  wrote:
> >>
> >> +1 to yank it and rework the fix.
> >>
> >> Gester's change helps, but it just means that you will sometimes
> randomly
> >> have a 2 minute delay starting up a gateway receiver. I don't think
> that is
> >> a great user experience either.
> >>
> >> -Dan
> >>
> >> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> bschucha...@pivotal.io>
> >> wrote:
> >>
> >>> Let's yank it
> >>>
> >>>
> >>>
>  On 9/4/18 5:04 PM, Sean Goller wrote:
> 
>  If it's to get the release out, I'm fine with reverting. I don't like
> >> it,
>  but I'm not willing to die on that hill. :)
> 
>  -S.
> 
>  On Tue, Sep 4, 2018 at 4:38 PM Dan Smith  wrote:
> 
>  Spitting this into a separate thread.
> >
> > I see the issue. The two minute timeout is the constructor for
> > AcceptorImpl, where it retries to bind for 2 minutes.
> >
> > That behavior makes sense for CacheServer.start.
> >
> > But it doesn't make sense for the new logic in
> GatewayReceiver.start()
> > from
> > GEODE-5591. That code is trying to use CacheServer.start to scan for
> an
> > available port, trying each port in a range. That free port finding
> >> logic
> > really doesn't want to have two minutes of retries for each port. It
> > seems
> > like we need to rework the fix for GEODE-5591.
> >
> > Does it make sense to hold up the release to rework this fix, or
> should
> > we
> > just revert it? Have we switched concourse over to using alpine
> linux,
> > which I think was the original motivation for this fix?
> >
> > -Dan
> >
> > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
> >
> > Why is it waiting at all in this case? Where is this 2 minute timeout
> >> coming from?
> >>
> >> -Dan
> >>
> >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> >>
> > sai.boorlaga...@gmail.com
> >
> >> wrote:
> >>> So the issue is that it takes longer to start than previous
> releases?
> >>> Also, is this wait time only when using Gfsh to create
> >>> gateway-receiver?
> >>>
> >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag 
> wrote:
> >>>
> >>> Currently we have a minor issue in the release branch as pointed
> out
> 
> >>> by
> >
> >> Barry O.
>  We will wait till a resolution is figured out for this issue.
> 
>  Steps:
>  1. create locator
>  2. start server --name=server1 --server-port=40404
>  3. start server --name=server2 --server-port=40405
>  4. create gateway-receiver --member=server1
>  5. create gateway-receiver --member=server2 `This gets stuck for 2
> 
> >>> minutes`
> >>>
>  Is the 2 minute wait time acceptable? Should we document it? When
> we
> 
> >>> revert
> >>>
>  GEODE-5591, this issue does not happen.
> 
>  Regards
>  Nabarun Nag
> 
> 
> >>>
> >>
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Dan Smith
+1 to yank it and rework the fix.

Gester's change helps, but it just means that you will sometimes randomly
have a 2 minute delay starting up a gateway receiver. I don't think that is
a great user experience either.

-Dan

On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt 
wrote:

> Let's yank it
>
>
>
> On 9/4/18 5:04 PM, Sean Goller wrote:
>
>> If it's to get the release out, I'm fine with reverting. I don't like it,
>> but I'm not willing to die on that hill. :)
>>
>> -S.
>>
>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith  wrote:
>>
>> Spitting this into a separate thread.
>>>
>>> I see the issue. The two minute timeout is the constructor for
>>> AcceptorImpl, where it retries to bind for 2 minutes.
>>>
>>> That behavior makes sense for CacheServer.start.
>>>
>>> But it doesn't make sense for the new logic in GatewayReceiver.start()
>>> from
>>> GEODE-5591. That code is trying to use CacheServer.start to scan for an
>>> available port, trying each port in a range. That free port finding logic
>>> really doesn't want to have two minutes of retries for each port. It
>>> seems
>>> like we need to rework the fix for GEODE-5591.
>>>
>>> Does it make sense to hold up the release to rework this fix, or should
>>> we
>>> just revert it? Have we switched concourse over to using alpine linux,
>>> which I think was the original motivation for this fix?
>>>
>>> -Dan
>>>
>>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
>>>
>>> Why is it waiting at all in this case? Where is this 2 minute timeout
 coming from?

 -Dan

 On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <

>>> sai.boorlaga...@gmail.com
>>>
 wrote:
> So the issue is that it takes longer to start than previous releases?
> Also, is this wait time only when using Gfsh to create
> gateway-receiver?
>
> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
>
> Currently we have a minor issue in the release branch as pointed out
>>
> by
>>>
 Barry O.
>> We will wait till a resolution is figured out for this issue.
>>
>> Steps:
>> 1. create locator
>> 2. start server --name=server1 --server-port=40404
>> 3. start server --name=server2 --server-port=40405
>> 4. create gateway-receiver --member=server1
>> 5. create gateway-receiver --member=server2 `This gets stuck for 2
>>
> minutes`
>
>> Is the 2 minute wait time acceptable? Should we document it? When we
>>
> revert
>
>> GEODE-5591, this issue does not happen.
>>
>> Regards
>> Nabarun Nag
>>
>>
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Bruce Schuchardt

Let's yank it


On 9/4/18 5:04 PM, Sean Goller wrote:

If it's to get the release out, I'm fine with reverting. I don't like it,
but I'm not willing to die on that hill. :)

-S.

On Tue, Sep 4, 2018 at 4:38 PM Dan Smith  wrote:


Spitting this into a separate thread.

I see the issue. The two minute timeout is the constructor for
AcceptorImpl, where it retries to bind for 2 minutes.

That behavior makes sense for CacheServer.start.

But it doesn't make sense for the new logic in GatewayReceiver.start() from
GEODE-5591. That code is trying to use CacheServer.start to scan for an
available port, trying each port in a range. That free port finding logic
really doesn't want to have two minutes of retries for each port. It seems
like we need to rework the fix for GEODE-5591.

Does it make sense to hold up the release to rework this fix, or should we
just revert it? Have we switched concourse over to using alpine linux,
which I think was the original motivation for this fix?

-Dan

On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:


Why is it waiting at all in this case? Where is this 2 minute timeout
coming from?

-Dan

On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <

sai.boorlaga...@gmail.com

wrote:
So the issue is that it takes longer to start than previous releases?
Also, is this wait time only when using Gfsh to create gateway-receiver?

On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:


Currently we have a minor issue in the release branch as pointed out

by

Barry O.
We will wait till a resolution is figured out for this issue.

Steps:
1. create locator
2. start server --name=server1 --server-port=40404
3. start server --name=server2 --server-port=40405
4. create gateway-receiver --member=server1
5. create gateway-receiver --member=server2 `This gets stuck for 2

minutes`

Is the 2 minute wait time acceptable? Should we document it? When we

revert

GEODE-5591, this issue does not happen.

Regards
Nabarun Nag





Re: 2 minute gateway startup time due to GEODE-5591

2018-09-05 Thread Pulkit Chandra
As downstream consumers of Geode, we do not want to be exposed to this.
Please revert and fix on develop. Also, could we put a test case to guard
us against this in future?

Thanks,

*Pulkit Chandra*



On Wed, Sep 5, 2018 at 1:07 AM Xiaojian Zhou  wrote:

> Yes. The current fix is to let each gateway receiver (in hydra tests,
> there're a lot) to compete port 5500. Only one member will win, all other
> members will timeout after 2 minutes. Then they keep compete for port 5501.
> Again, only one member will win.
>
> In that case, if there are 5 receivers, it will take 10 minutes to start
> all the receivers.
>
> So I enhanced the current fix (see the diff attached) to let each receiver
> to pick a random port to start, if any one failed, only this guy will try
> currPort++. If reached endPort, continue on startPort, until reached his
> random port again.
>
> To enhance the 2-minute-timeout is definitely another issue.
>
> Regards
> Gester
>
> On Tue, Sep 4, 2018 at 4:38 PM, Dan Smith  wrote:
>
>> Spitting this into a separate thread.
>>
>> I see the issue. The two minute timeout is the constructor for
>> AcceptorImpl, where it retries to bind for 2 minutes.
>>
>> That behavior makes sense for CacheServer.start.
>>
>> But it doesn't make sense for the new logic in GatewayReceiver.start()
>> from
>> GEODE-5591. That code is trying to use CacheServer.start to scan for an
>> available port, trying each port in a range. That free port finding logic
>> really doesn't want to have two minutes of retries for each port. It seems
>> like we need to rework the fix for GEODE-5591.
>>
>> Does it make sense to hold up the release to rework this fix, or should we
>> just revert it? Have we switched concourse over to using alpine linux,
>> which I think was the original motivation for this fix?
>>
>> -Dan
>>
>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
>>
>> > Why is it waiting at all in this case? Where is this 2 minute timeout
>> > coming from?
>> >
>> > -Dan
>> >
>> > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
>> sai.boorlaga...@gmail.com
>> > > wrote:
>> >
>> >> So the issue is that it takes longer to start than previous releases?
>> >> Also, is this wait time only when using Gfsh to create
>> gateway-receiver?
>> >>
>> >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
>> >>
>> >> > Currently we have a minor issue in the release branch as pointed out
>> by
>> >> > Barry O.
>> >> > We will wait till a resolution is figured out for this issue.
>> >> >
>> >> > Steps:
>> >> > 1. create locator
>> >> > 2. start server --name=server1 --server-port=40404
>> >> > 3. start server --name=server2 --server-port=40405
>> >> > 4. create gateway-receiver --member=server1
>> >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2
>> >> minutes`
>> >> >
>> >> > Is the 2 minute wait time acceptable? Should we document it? When we
>> >> revert
>> >> > GEODE-5591, this issue does not happen.
>> >> >
>> >> > Regards
>> >> > Nabarun Nag
>> >> >
>> >>
>> >
>>
>
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Xiaojian Zhou
Yes. The current fix is to let each gateway receiver (in hydra tests,
there're a lot) to compete port 5500. Only one member will win, all other
members will timeout after 2 minutes. Then they keep compete for port 5501.
Again, only one member will win.

In that case, if there are 5 receivers, it will take 10 minutes to start
all the receivers.

So I enhanced the current fix (see the diff attached) to let each receiver
to pick a random port to start, if any one failed, only this guy will try
currPort++. If reached endPort, continue on startPort, until reached his
random port again.

To enhance the 2-minute-timeout is definitely another issue.

Regards
Gester

On Tue, Sep 4, 2018 at 4:38 PM, Dan Smith  wrote:

> Spitting this into a separate thread.
>
> I see the issue. The two minute timeout is the constructor for
> AcceptorImpl, where it retries to bind for 2 minutes.
>
> That behavior makes sense for CacheServer.start.
>
> But it doesn't make sense for the new logic in GatewayReceiver.start() from
> GEODE-5591. That code is trying to use CacheServer.start to scan for an
> available port, trying each port in a range. That free port finding logic
> really doesn't want to have two minutes of retries for each port. It seems
> like we need to rework the fix for GEODE-5591.
>
> Does it make sense to hold up the release to rework this fix, or should we
> just revert it? Have we switched concourse over to using alpine linux,
> which I think was the original motivation for this fix?
>
> -Dan
>
> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
>
> > Why is it waiting at all in this case? Where is this 2 minute timeout
> > coming from?
> >
> > -Dan
> >
> > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> sai.boorlaga...@gmail.com
> > > wrote:
> >
> >> So the issue is that it takes longer to start than previous releases?
> >> Also, is this wait time only when using Gfsh to create gateway-receiver?
> >>
> >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
> >>
> >> > Currently we have a minor issue in the release branch as pointed out
> by
> >> > Barry O.
> >> > We will wait till a resolution is figured out for this issue.
> >> >
> >> > Steps:
> >> > 1. create locator
> >> > 2. start server --name=server1 --server-port=40404
> >> > 3. start server --name=server2 --server-port=40405
> >> > 4. create gateway-receiver --member=server1
> >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2
> >> minutes`
> >> >
> >> > Is the 2 minute wait time acceptable? Should we document it? When we
> >> revert
> >> > GEODE-5591, this issue does not happen.
> >> >
> >> > Regards
> >> > Nabarun Nag
> >> >
> >>
> >
>
diff --git 
a/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java
 
b/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java
index a09194209..e13e7ec78 100644
--- 
a/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java
+++ 
b/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/WANTestBase.java
@@ -2020,7 +2020,7 @@ public class WANTestBase extends DistributedTestCase {
 GatewayReceiver receiver = fact.create();
 assertThatThrownBy(receiver::start)
 .isInstanceOf(GatewayReceiverException.class)
-.hasMessageContaining("No available free port found in the given 
range");
+.hasMessageContaining("Failed to create server socket on");
   }
 
   public static int createReceiverWithSSL(int locPort) {
diff --git 
a/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java
 
b/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java
index 038b759ae..ccd9503e6 100644
--- 
a/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java
+++ 
b/geode-wan/src/integrationTest/java/org/apache/geode/internal/cache/wan/misc/WANConfigurationJUnitTest.java
@@ -448,7 +448,8 @@ public class WANConfigurationJUnitTest {
 
 
 GatewayReceiver receiver = fact.create();
-assertThatThrownBy(() -> 
receiver.start()).isInstanceOf(GatewayReceiverException.class);
+assertThatThrownBy(() -> 
receiver.start()).isInstanceOf(GatewayReceiverException.class)
+.hasMessageContaining("Failed to create server socket on");
   }
 
   @Test
diff --git 
a/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java
 
b/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java
index cd2702991..786b354a4 100644
--- 
a/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java
+++ 
b/geode-wan/src/main/java/org/apache/geode/internal/cache/wan/GatewayReceiverImpl.java
@@ -26,6 +26,7 @@ import org.apache.geode.cache.wan.GatewayReceiver;
 import org.apache.geode.cache.wan.GatewayTransportFilter;
 import 

Re: 2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Jacob Barrett
Revert it on the release branch and fix it on develop.

> On Sep 4, 2018, at 5:13 PM, Sean Goller  wrote:
> 
> It affects us on any linux platform that doesn't use glibc. It's not worth
> holding up the release for. It's been this way for 20 years, right? ;)
> 
> Revert it.
> 
>> On Tue, Sep 4, 2018 at 5:09 PM Udo Kohlmeyer  wrote:
>> 
>> Imo (and I'm coming in cold)... We are NOT officially supporting Alpine
>> linux (yet), which is the basis for this ticket, maybe push this to a
>> later release?
>> 
>> I prefer us getting out the fixes we have and release a more optimal
>> version of GEODE-5591 later.
>> 
>> IF this is a bug that will affect us on EVERY linux distro, then we
>> should fix, otherwise, I vote to push it to 1.8
>> 
>> --Udo
>> 
>> 
>>> On 9/4/18 16:38, Dan Smith wrote:
>>> Spitting this into a separate thread.
>>> 
>>> I see the issue. The two minute timeout is the constructor for
>>> AcceptorImpl, where it retries to bind for 2 minutes.
>>> 
>>> That behavior makes sense for CacheServer.start.
>>> 
>>> But it doesn't make sense for the new logic in GatewayReceiver.start()
>> from
>>> GEODE-5591. That code is trying to use CacheServer.start to scan for an
>>> available port, trying each port in a range. That free port finding logic
>>> really doesn't want to have two minutes of retries for each port. It
>> seems
>>> like we need to rework the fix for GEODE-5591.
>>> 
>>> Does it make sense to hold up the release to rework this fix, or should
>> we
>>> just revert it? Have we switched concourse over to using alpine linux,
>>> which I think was the original motivation for this fix?
>>> 
>>> -Dan
>>> 
 On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
 
 Why is it waiting at all in this case? Where is this 2 minute timeout
 coming from?
 
 -Dan
 
 On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
>> sai.boorlaga...@gmail.com
> wrote:
> So the issue is that it takes longer to start than previous releases?
> Also, is this wait time only when using Gfsh to create
>> gateway-receiver?
> 
>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
>> 
>> Currently we have a minor issue in the release branch as pointed out
>> by
>> Barry O.
>> We will wait till a resolution is figured out for this issue.
>> 
>> Steps:
>> 1. create locator
>> 2. start server --name=server1 --server-port=40404
>> 3. start server --name=server2 --server-port=40405
>> 4. create gateway-receiver --member=server1
>> 5. create gateway-receiver --member=server2 `This gets stuck for 2
> minutes`
>> Is the 2 minute wait time acceptable? Should we document it? When we
> revert
>> GEODE-5591, this issue does not happen.
>> 
>> Regards
>> Nabarun Nag
>> 
>> 
>> 


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Anilkumar Gingade
We should fix this for the release.
-Anil.


On Tue, Sep 4, 2018 at 5:09 PM Udo Kohlmeyer  wrote:

> Imo (and I'm coming in cold)... We are NOT officially supporting Alpine
> linux (yet), which is the basis for this ticket, maybe push this to a
> later release?
>
> I prefer us getting out the fixes we have and release a more optimal
> version of GEODE-5591 later.
>
> IF this is a bug that will affect us on EVERY linux distro, then we
> should fix, otherwise, I vote to push it to 1.8
>
> --Udo
>
>
> On 9/4/18 16:38, Dan Smith wrote:
> > Spitting this into a separate thread.
> >
> > I see the issue. The two minute timeout is the constructor for
> > AcceptorImpl, where it retries to bind for 2 minutes.
> >
> > That behavior makes sense for CacheServer.start.
> >
> > But it doesn't make sense for the new logic in GatewayReceiver.start()
> from
> > GEODE-5591. That code is trying to use CacheServer.start to scan for an
> > available port, trying each port in a range. That free port finding logic
> > really doesn't want to have two minutes of retries for each port. It
> seems
> > like we need to rework the fix for GEODE-5591.
> >
> > Does it make sense to hold up the release to rework this fix, or should
> we
> > just revert it? Have we switched concourse over to using alpine linux,
> > which I think was the original motivation for this fix?
> >
> > -Dan
> >
> > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
> >
> >> Why is it waiting at all in this case? Where is this 2 minute timeout
> >> coming from?
> >>
> >> -Dan
> >>
> >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> sai.boorlaga...@gmail.com
> >>> wrote:
> >>> So the issue is that it takes longer to start than previous releases?
> >>> Also, is this wait time only when using Gfsh to create
> gateway-receiver?
> >>>
> >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
> >>>
>  Currently we have a minor issue in the release branch as pointed out
> by
>  Barry O.
>  We will wait till a resolution is figured out for this issue.
> 
>  Steps:
>  1. create locator
>  2. start server --name=server1 --server-port=40404
>  3. start server --name=server2 --server-port=40405
>  4. create gateway-receiver --member=server1
>  5. create gateway-receiver --member=server2 `This gets stuck for 2
> >>> minutes`
>  Is the 2 minute wait time acceptable? Should we document it? When we
> >>> revert
>  GEODE-5591, this issue does not happen.
> 
>  Regards
>  Nabarun Nag
> 
>
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Sean Goller
It affects us on any linux platform that doesn't use glibc. It's not worth
holding up the release for. It's been this way for 20 years, right? ;)

Revert it.

On Tue, Sep 4, 2018 at 5:09 PM Udo Kohlmeyer  wrote:

> Imo (and I'm coming in cold)... We are NOT officially supporting Alpine
> linux (yet), which is the basis for this ticket, maybe push this to a
> later release?
>
> I prefer us getting out the fixes we have and release a more optimal
> version of GEODE-5591 later.
>
> IF this is a bug that will affect us on EVERY linux distro, then we
> should fix, otherwise, I vote to push it to 1.8
>
> --Udo
>
>
> On 9/4/18 16:38, Dan Smith wrote:
> > Spitting this into a separate thread.
> >
> > I see the issue. The two minute timeout is the constructor for
> > AcceptorImpl, where it retries to bind for 2 minutes.
> >
> > That behavior makes sense for CacheServer.start.
> >
> > But it doesn't make sense for the new logic in GatewayReceiver.start()
> from
> > GEODE-5591. That code is trying to use CacheServer.start to scan for an
> > available port, trying each port in a range. That free port finding logic
> > really doesn't want to have two minutes of retries for each port. It
> seems
> > like we need to rework the fix for GEODE-5591.
> >
> > Does it make sense to hold up the release to rework this fix, or should
> we
> > just revert it? Have we switched concourse over to using alpine linux,
> > which I think was the original motivation for this fix?
> >
> > -Dan
> >
> > On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
> >
> >> Why is it waiting at all in this case? Where is this 2 minute timeout
> >> coming from?
> >>
> >> -Dan
> >>
> >> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> sai.boorlaga...@gmail.com
> >>> wrote:
> >>> So the issue is that it takes longer to start than previous releases?
> >>> Also, is this wait time only when using Gfsh to create
> gateway-receiver?
> >>>
> >>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
> >>>
>  Currently we have a minor issue in the release branch as pointed out
> by
>  Barry O.
>  We will wait till a resolution is figured out for this issue.
> 
>  Steps:
>  1. create locator
>  2. start server --name=server1 --server-port=40404
>  3. start server --name=server2 --server-port=40405
>  4. create gateway-receiver --member=server1
>  5. create gateway-receiver --member=server2 `This gets stuck for 2
> >>> minutes`
>  Is the 2 minute wait time acceptable? Should we document it? When we
> >>> revert
>  GEODE-5591, this issue does not happen.
> 
>  Regards
>  Nabarun Nag
> 
>
>


Re: 2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Udo Kohlmeyer
Imo (and I'm coming in cold)... We are NOT officially supporting Alpine 
linux (yet), which is the basis for this ticket, maybe push this to a 
later release?


I prefer us getting out the fixes we have and release a more optimal 
version of GEODE-5591 later.


IF this is a bug that will affect us on EVERY linux distro, then we 
should fix, otherwise, I vote to push it to 1.8


--Udo


On 9/4/18 16:38, Dan Smith wrote:

Spitting this into a separate thread.

I see the issue. The two minute timeout is the constructor for
AcceptorImpl, where it retries to bind for 2 minutes.

That behavior makes sense for CacheServer.start.

But it doesn't make sense for the new logic in GatewayReceiver.start() from
GEODE-5591. That code is trying to use CacheServer.start to scan for an
available port, trying each port in a range. That free port finding logic
really doesn't want to have two minutes of retries for each port. It seems
like we need to rework the fix for GEODE-5591.

Does it make sense to hold up the release to rework this fix, or should we
just revert it? Have we switched concourse over to using alpine linux,
which I think was the original motivation for this fix?

-Dan

On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:


Why is it waiting at all in this case? Where is this 2 minute timeout
coming from?

-Dan

On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda 
wrote:
So the issue is that it takes longer to start than previous releases?
Also, is this wait time only when using Gfsh to create gateway-receiver?

On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:


Currently we have a minor issue in the release branch as pointed out by
Barry O.
We will wait till a resolution is figured out for this issue.

Steps:
1. create locator
2. start server --name=server1 --server-port=40404
3. start server --name=server2 --server-port=40405
4. create gateway-receiver --member=server1
5. create gateway-receiver --member=server2 `This gets stuck for 2

minutes`

Is the 2 minute wait time acceptable? Should we document it? When we

revert

GEODE-5591, this issue does not happen.

Regards
Nabarun Nag





Re: 2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Sean Goller
If it's to get the release out, I'm fine with reverting. I don't like it,
but I'm not willing to die on that hill. :)

-S.

On Tue, Sep 4, 2018 at 4:38 PM Dan Smith  wrote:

> Spitting this into a separate thread.
>
> I see the issue. The two minute timeout is the constructor for
> AcceptorImpl, where it retries to bind for 2 minutes.
>
> That behavior makes sense for CacheServer.start.
>
> But it doesn't make sense for the new logic in GatewayReceiver.start() from
> GEODE-5591. That code is trying to use CacheServer.start to scan for an
> available port, trying each port in a range. That free port finding logic
> really doesn't want to have two minutes of retries for each port. It seems
> like we need to rework the fix for GEODE-5591.
>
> Does it make sense to hold up the release to rework this fix, or should we
> just revert it? Have we switched concourse over to using alpine linux,
> which I think was the original motivation for this fix?
>
> -Dan
>
> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:
>
> > Why is it waiting at all in this case? Where is this 2 minute timeout
> > coming from?
> >
> > -Dan
> >
> > On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda <
> sai.boorlaga...@gmail.com
> > > wrote:
> >
> >> So the issue is that it takes longer to start than previous releases?
> >> Also, is this wait time only when using Gfsh to create gateway-receiver?
> >>
> >> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
> >>
> >> > Currently we have a minor issue in the release branch as pointed out
> by
> >> > Barry O.
> >> > We will wait till a resolution is figured out for this issue.
> >> >
> >> > Steps:
> >> > 1. create locator
> >> > 2. start server --name=server1 --server-port=40404
> >> > 3. start server --name=server2 --server-port=40405
> >> > 4. create gateway-receiver --member=server1
> >> > 5. create gateway-receiver --member=server2 `This gets stuck for 2
> >> minutes`
> >> >
> >> > Is the 2 minute wait time acceptable? Should we document it? When we
> >> revert
> >> > GEODE-5591, this issue does not happen.
> >> >
> >> > Regards
> >> > Nabarun Nag
> >> >
> >>
> >
>


2 minute gateway startup time due to GEODE-5591

2018-09-04 Thread Dan Smith
Spitting this into a separate thread.

I see the issue. The two minute timeout is the constructor for
AcceptorImpl, where it retries to bind for 2 minutes.

That behavior makes sense for CacheServer.start.

But it doesn't make sense for the new logic in GatewayReceiver.start() from
GEODE-5591. That code is trying to use CacheServer.start to scan for an
available port, trying each port in a range. That free port finding logic
really doesn't want to have two minutes of retries for each port. It seems
like we need to rework the fix for GEODE-5591.

Does it make sense to hold up the release to rework this fix, or should we
just revert it? Have we switched concourse over to using alpine linux,
which I think was the original motivation for this fix?

-Dan

On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith  wrote:

> Why is it waiting at all in this case? Where is this 2 minute timeout
> coming from?
>
> -Dan
>
> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda  > wrote:
>
>> So the issue is that it takes longer to start than previous releases?
>> Also, is this wait time only when using Gfsh to create gateway-receiver?
>>
>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag  wrote:
>>
>> > Currently we have a minor issue in the release branch as pointed out by
>> > Barry O.
>> > We will wait till a resolution is figured out for this issue.
>> >
>> > Steps:
>> > 1. create locator
>> > 2. start server --name=server1 --server-port=40404
>> > 3. start server --name=server2 --server-port=40405
>> > 4. create gateway-receiver --member=server1
>> > 5. create gateway-receiver --member=server2 `This gets stuck for 2
>> minutes`
>> >
>> > Is the 2 minute wait time acceptable? Should we document it? When we
>> revert
>> > GEODE-5591, this issue does not happen.
>> >
>> > Regards
>> > Nabarun Nag
>> >
>>
>