Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Ryan McMahon
The cherry-pick for GEODE-7088 is clean so I didn't open a PR for that
one.  The cherry-pick for GEODE-7089 required manual merging due to several
unrelated stats changes added to develop recently.  The PR to merge that
one into release/1.10.0 is here:
https://github.com/apache/geode/pull/3976

The original PR for this change when it was added to develop is here:
https://github.com/apache/geode/pull/3929

Thanks all for reading and considering.

Ryan

On Mon, Aug 26, 2019 at 4:22 PM Udo Kohlmeyer  wrote:

> Thank you Ryan,
>
> +1 for inclusion
>
> On 8/26/19 3:33 PM, Ryan McMahon wrote:
> > Udo,
> >
> > Here are inline answers to your questions:
> >
> > *Is this an existing issue?*
> >
> > Short answer - yes, but it has never been in a release version of Geode.
> > The leak was introduced as part of some changes to address handling
> > multiple concurrent registration requests for a given client on a single
> > server.  The issue is only seen if client registration fails which is not
> > particularly common, which is why we are only seeing it now after months
> of
> > testing.  The commit for that was introduced here on April 30th.
> >
> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
> > The ConcurrentModificationException issue (which ultimately causes the
> > registration to fail) was introduced on April 22nd here:
> >
> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
> >
> >
> > *Why is it more critical to squeeze it into an existing (almost
> > release) version of Apache Geode?*
> >
> > Not sure I totally understand this question, but it is critical because
> the
> > leak can cause servers to crash due to OOM.  Again, because the problems
> > were introduced in late April and we haven't released Geode since then,
> so
> > I think it is very important to merge these fixes into 1.10.0 before we
> > release.
> >
> >
> >
> > *What guarantees do we have that this fix makes the application more
> stable
> > compared to adding another hidden issue, which we will discover in
> another
> > few weeks from now?*
> >
> > I added numerous tests for this scenario to ensure that the leak would
> > never happen regardless of the cause of a failed client registration.
> > There obviously is no way to 100% guarantee that there will be no issues
> > that arise from the fixes themselves, but our existing test coverage plus
> > the new tests I wrote should give us the confidence we need.
> >
> > Thanks,
> > Ryan
> >
> > On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer  wrote:
> >
> >> In order to better understand this request:
> >>
> >> Is this an existing issue?
> >>
> >> Why is it more critical to squeeze it into an existing (almost release)
> >> version of Apache Geode?
> >>
> >> What guarantees do we have that this fix makes the application more
> >> stable compared to adding another hidden issue, which we will discover
> >> in another few weeks from now?
> >>
> >>
> >> --Udo
> >>
> >> On 8/26/19 3:10 PM, Ryan McMahon wrote:
> >>> Hi all,
> >>>
> >>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
> >>> 1.10.0 release branch.  The two JIRAs are related to the same root
> >> problem,
> >>> which I would classify as critical.  We discovered a case where a
> failed
> >>> client registration could lead to a memory leak in a server, eventually
> >>> causing the server to crash due to lack of memory.
> >>>
> >>> The issue is instigated by a ConcurrentModificationException due to
> >>> iteration of a non-thread safe collection while it is being mutated
> >>> (GEODE-7088).  This exception occurs when the client's queue image is
> >> being
> >>> copied from one server to the next during client registration, and it
> >>> causes the client's registration to fail.  The client would likely
> >> succeed
> >>> if it retried registration with that same server, but if it registers
> >> with
> >>> a different server, we end up leaking events to the client's
> registration
> >>> queue on the original server (GEODE-7089).
> >>>
> >>> The fix for GEODE-7088 is to use thread-safe collections for interested
> >>> clients in client update messages.  The fix for GEODE-7089 is to always
> >>> drain and remove the registration queue regardless of success or
> failure.
> >>> Together, these fixes prevent the failed registrations and memory leak.
> >>>
> >>> The SHAs for the fixes and tests in develop are:
> >>>
> >>> GEODE-7088
> >>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
> >>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
> >>>
> >>> GEODE-7089
> >>> - 5d0153ad4adb1612a1083673f98b1982819a6589
> >>>
> >>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
> >>>
> >>> Thanks,
> >>> Ryan McMahon
> >>>
>


Re: [DISCUSS] Pulling the current proposed 1.10 release until we can agree on develop being stable

2019-08-26 Thread Owen Nichols
Udo, it sounds like you would like to modify the release process to start the 
stabilization work on develop (rather than cutting the branch as the first 
step).  I would love to hear community opinion on this proposal.

For 1.10, in the time we have cherry-picked 11 fixes to release/1.10.0, develop 
has seen over 90 commits.  I believe due process has been followed in bringing 
only critical fixes to stabilize the release branch.  Getting 1.10.0.RC1 into 
the hands of the community is the most important next step.  If there are no 
further critical fix proposals, Dick and I would like to proceed with the RC1 
this week.

-Owen

> On Aug 26, 2019, at 4:06 PM, Udo Kohlmeyer  wrote:
> 
> Hi there Apache Geode devs,
> 
> It has been some weeks since the proposed 1.10 release was cut. We've gone 
> through a few cycles where we keep on submitting "please include ticket 
> GEODE-XXX" because it is critical and will break the system. WHICH in reality 
> tells me that current develop is broken and unstable.
> 
> I'm going to suggest that we abandon the current 1.10 release branch. I 
> cannot shake the feeling that our continuous cherry picking into a branch 
> will result in either the branch becoming unmaintainable, given we have only 
> select fixes in the branch OR we end up with a branch that is more stable 
> than our current development branch, which would result in us having to 
> rebase the develop branch onto the 1.10 branch.
> 
> I propose that once we see the pipeline is clean and green for a solid we 
> then again attempt to cut 1.10 branch.
> 
> We CANNOT continue adding to a branch in order to stabilize it.. It just 
> means the branch we cut from is unstable. If we cannot cut a branch from 
> develop without having to have weeks of stabilization cycles, then our main 
> branch is broken...
> 
> Either way, not a good spot to be in.
> 
> Thoughts?
> 
> --Udo
> 



Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Udo Kohlmeyer

Thank you Ryan,

+1 for inclusion

On 8/26/19 3:33 PM, Ryan McMahon wrote:

Udo,

Here are inline answers to your questions:

*Is this an existing issue?*

Short answer - yes, but it has never been in a release version of Geode.
The leak was introduced as part of some changes to address handling
multiple concurrent registration requests for a given client on a single
server.  The issue is only seen if client registration fails which is not
particularly common, which is why we are only seeing it now after months of
testing.  The commit for that was introduced here on April 30th.
https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
The ConcurrentModificationException issue (which ultimately causes the
registration to fail) was introduced on April 22nd here:
https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f


*Why is it more critical to squeeze it into an existing (almost
release) version of Apache Geode?*

Not sure I totally understand this question, but it is critical because the
leak can cause servers to crash due to OOM.  Again, because the problems
were introduced in late April and we haven't released Geode since then, so
I think it is very important to merge these fixes into 1.10.0 before we
release.



*What guarantees do we have that this fix makes the application more stable
compared to adding another hidden issue, which we will discover in another
few weeks from now?*

I added numerous tests for this scenario to ensure that the leak would
never happen regardless of the cause of a failed client registration.
There obviously is no way to 100% guarantee that there will be no issues
that arise from the fixes themselves, but our existing test coverage plus
the new tests I wrote should give us the confidence we need.

Thanks,
Ryan

On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer  wrote:


In order to better understand this request:

Is this an existing issue?

Why is it more critical to squeeze it into an existing (almost release)
version of Apache Geode?

What guarantees do we have that this fix makes the application more
stable compared to adding another hidden issue, which we will discover
in another few weeks from now?


--Udo

On 8/26/19 3:10 PM, Ryan McMahon wrote:

Hi all,

I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
1.10.0 release branch.  The two JIRAs are related to the same root

problem,

which I would classify as critical.  We discovered a case where a failed
client registration could lead to a memory leak in a server, eventually
causing the server to crash due to lack of memory.

The issue is instigated by a ConcurrentModificationException due to
iteration of a non-thread safe collection while it is being mutated
(GEODE-7088).  This exception occurs when the client's queue image is

being

copied from one server to the next during client registration, and it
causes the client's registration to fail.  The client would likely

succeed

if it retried registration with that same server, but if it registers

with

a different server, we end up leaking events to the client's registration
queue on the original server (GEODE-7089).

The fix for GEODE-7088 is to use thread-safe collections for interested
clients in client update messages.  The fix for GEODE-7089 is to always
drain and remove the registration queue regardless of success or failure.
Together, these fixes prevent the failed registrations and memory leak.

The SHAs for the fixes and tests in develop are:

GEODE-7088
- 174af1d23fb7e09eb2bc2fa55479df854850fadb
- 5bb753a8f4ff2886acd8e62d6f51fea58e37881d

GEODE-7089
- 5d0153ad4adb1612a1083673f98b1982819a6589

This proposal is to cherry-pick these commits to 1.10.0 release branch.

Thanks,
Ryan McMahon



[DISCUSS] Pulling the current proposed 1.10 release until we can agree on develop being stable

2019-08-26 Thread Udo Kohlmeyer

Hi there Apache Geode devs,

It has been some weeks since the proposed 1.10 release was cut. We've 
gone through a few cycles where we keep on submitting "please include 
ticket GEODE-XXX" because it is critical and will break the system. 
WHICH in reality tells me that current develop is broken and unstable.


I'm going to suggest that we abandon the current 1.10 release branch. I 
cannot shake the feeling that our continuous cherry picking into a 
branch will result in either the branch becoming unmaintainable, given 
we have only select fixes in the branch OR we end up with a branch that 
is more stable than our current development branch, which would result 
in us having to rebase the develop branch onto the 1.10 branch.


I propose that once we see the pipeline is clean and green for a solid 
we then again attempt to cut 1.10 branch.


We CANNOT continue adding to a branch in order to stabilize it.. It just 
means the branch we cut from is unstable. If we cannot cut a branch from 
develop without having to have weeks of stabilization cycles, then our 
main branch is broken...


Either way, not a good spot to be in.

Thoughts?

--Udo



Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Nabarun Nag
+1

This will be a good inclusion in Apache Geode 1.10.0 release.

Regards
Naba

On Mon, Aug 26, 2019 at 3:36 PM Jacob Barrett  wrote:

> +1
>
> Thanks for the details!
>
> > On Aug 26, 2019, at 3:33 PM, Ryan McMahon  wrote:
> >
> > Udo,
> >
> > Here are inline answers to your questions:
> >
> > *Is this an existing issue?*
> >
> > Short answer - yes, but it has never been in a release version of Geode.
> > The leak was introduced as part of some changes to address handling
> > multiple concurrent registration requests for a given client on a single
> > server.  The issue is only seen if client registration fails which is not
> > particularly common, which is why we are only seeing it now after months
> of
> > testing.  The commit for that was introduced here on April 30th.
> >
> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
> > The ConcurrentModificationException issue (which ultimately causes the
> > registration to fail) was introduced on April 22nd here:
> >
> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
> >
> >
> > *Why is it more critical to squeeze it into an existing (almost
> > release) version of Apache Geode?*
> >
> > Not sure I totally understand this question, but it is critical because
> the
> > leak can cause servers to crash due to OOM.  Again, because the problems
> > were introduced in late April and we haven't released Geode since then,
> so
> > I think it is very important to merge these fixes into 1.10.0 before we
> > release.
> >
> >
> >
> > *What guarantees do we have that this fix makes the application more
> stable
> > compared to adding another hidden issue, which we will discover in
> another
> > few weeks from now?*
> >
> > I added numerous tests for this scenario to ensure that the leak would
> > never happen regardless of the cause of a failed client registration.
> > There obviously is no way to 100% guarantee that there will be no issues
> > that arise from the fixes themselves, but our existing test coverage plus
> > the new tests I wrote should give us the confidence we need.
> >
> > Thanks,
> > Ryan
> >
> > On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer  wrote:
> >
> >> In order to better understand this request:
> >>
> >> Is this an existing issue?
> >>
> >> Why is it more critical to squeeze it into an existing (almost release)
> >> version of Apache Geode?
> >>
> >> What guarantees do we have that this fix makes the application more
> >> stable compared to adding another hidden issue, which we will discover
> >> in another few weeks from now?
> >>
> >>
> >> --Udo
> >>
> >> On 8/26/19 3:10 PM, Ryan McMahon wrote:
> >>> Hi all,
> >>>
> >>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
> >>> 1.10.0 release branch.  The two JIRAs are related to the same root
> >> problem,
> >>> which I would classify as critical.  We discovered a case where a
> failed
> >>> client registration could lead to a memory leak in a server, eventually
> >>> causing the server to crash due to lack of memory.
> >>>
> >>> The issue is instigated by a ConcurrentModificationException due to
> >>> iteration of a non-thread safe collection while it is being mutated
> >>> (GEODE-7088).  This exception occurs when the client's queue image is
> >> being
> >>> copied from one server to the next during client registration, and it
> >>> causes the client's registration to fail.  The client would likely
> >> succeed
> >>> if it retried registration with that same server, but if it registers
> >> with
> >>> a different server, we end up leaking events to the client's
> registration
> >>> queue on the original server (GEODE-7089).
> >>>
> >>> The fix for GEODE-7088 is to use thread-safe collections for interested
> >>> clients in client update messages.  The fix for GEODE-7089 is to always
> >>> drain and remove the registration queue regardless of success or
> failure.
> >>> Together, these fixes prevent the failed registrations and memory leak.
> >>>
> >>> The SHAs for the fixes and tests in develop are:
> >>>
> >>> GEODE-7088
> >>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
> >>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
> >>>
> >>> GEODE-7089
> >>> - 5d0153ad4adb1612a1083673f98b1982819a6589
> >>>
> >>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
> >>>
> >>> Thanks,
> >>> Ryan McMahon
> >>>
> >>
>
>


Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Jacob Barrett
+1

Thanks for the details!

> On Aug 26, 2019, at 3:33 PM, Ryan McMahon  wrote:
> 
> Udo,
> 
> Here are inline answers to your questions:
> 
> *Is this an existing issue?*
> 
> Short answer - yes, but it has never been in a release version of Geode.
> The leak was introduced as part of some changes to address handling
> multiple concurrent registration requests for a given client on a single
> server.  The issue is only seen if client registration fails which is not
> particularly common, which is why we are only seeing it now after months of
> testing.  The commit for that was introduced here on April 30th.
> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
> The ConcurrentModificationException issue (which ultimately causes the
> registration to fail) was introduced on April 22nd here:
> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
> 
> 
> *Why is it more critical to squeeze it into an existing (almost
> release) version of Apache Geode?*
> 
> Not sure I totally understand this question, but it is critical because the
> leak can cause servers to crash due to OOM.  Again, because the problems
> were introduced in late April and we haven't released Geode since then, so
> I think it is very important to merge these fixes into 1.10.0 before we
> release.
> 
> 
> 
> *What guarantees do we have that this fix makes the application more stable
> compared to adding another hidden issue, which we will discover in another
> few weeks from now?*
> 
> I added numerous tests for this scenario to ensure that the leak would
> never happen regardless of the cause of a failed client registration.
> There obviously is no way to 100% guarantee that there will be no issues
> that arise from the fixes themselves, but our existing test coverage plus
> the new tests I wrote should give us the confidence we need.
> 
> Thanks,
> Ryan
> 
> On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer  wrote:
> 
>> In order to better understand this request:
>> 
>> Is this an existing issue?
>> 
>> Why is it more critical to squeeze it into an existing (almost release)
>> version of Apache Geode?
>> 
>> What guarantees do we have that this fix makes the application more
>> stable compared to adding another hidden issue, which we will discover
>> in another few weeks from now?
>> 
>> 
>> --Udo
>> 
>> On 8/26/19 3:10 PM, Ryan McMahon wrote:
>>> Hi all,
>>> 
>>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
>>> 1.10.0 release branch.  The two JIRAs are related to the same root
>> problem,
>>> which I would classify as critical.  We discovered a case where a failed
>>> client registration could lead to a memory leak in a server, eventually
>>> causing the server to crash due to lack of memory.
>>> 
>>> The issue is instigated by a ConcurrentModificationException due to
>>> iteration of a non-thread safe collection while it is being mutated
>>> (GEODE-7088).  This exception occurs when the client's queue image is
>> being
>>> copied from one server to the next during client registration, and it
>>> causes the client's registration to fail.  The client would likely
>> succeed
>>> if it retried registration with that same server, but if it registers
>> with
>>> a different server, we end up leaking events to the client's registration
>>> queue on the original server (GEODE-7089).
>>> 
>>> The fix for GEODE-7088 is to use thread-safe collections for interested
>>> clients in client update messages.  The fix for GEODE-7089 is to always
>>> drain and remove the registration queue regardless of success or failure.
>>> Together, these fixes prevent the failed registrations and memory leak.
>>> 
>>> The SHAs for the fixes and tests in develop are:
>>> 
>>> GEODE-7088
>>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
>>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
>>> 
>>> GEODE-7089
>>> - 5d0153ad4adb1612a1083673f98b1982819a6589
>>> 
>>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
>>> 
>>> Thanks,
>>> Ryan McMahon
>>> 
>> 



Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Ryan McMahon
Udo,

Here are inline answers to your questions:

*Is this an existing issue?*

Short answer - yes, but it has never been in a release version of Geode.
The leak was introduced as part of some changes to address handling
multiple concurrent registration requests for a given client on a single
server.  The issue is only seen if client registration fails which is not
particularly common, which is why we are only seeing it now after months of
testing.  The commit for that was introduced here on April 30th.
https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
The ConcurrentModificationException issue (which ultimately causes the
registration to fail) was introduced on April 22nd here:
https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f


*Why is it more critical to squeeze it into an existing (almost
release) version of Apache Geode?*

Not sure I totally understand this question, but it is critical because the
leak can cause servers to crash due to OOM.  Again, because the problems
were introduced in late April and we haven't released Geode since then, so
I think it is very important to merge these fixes into 1.10.0 before we
release.



*What guarantees do we have that this fix makes the application more stable
compared to adding another hidden issue, which we will discover in another
few weeks from now?*

I added numerous tests for this scenario to ensure that the leak would
never happen regardless of the cause of a failed client registration.
There obviously is no way to 100% guarantee that there will be no issues
that arise from the fixes themselves, but our existing test coverage plus
the new tests I wrote should give us the confidence we need.

Thanks,
Ryan

On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer  wrote:

> In order to better understand this request:
>
> Is this an existing issue?
>
> Why is it more critical to squeeze it into an existing (almost release)
> version of Apache Geode?
>
> What guarantees do we have that this fix makes the application more
> stable compared to adding another hidden issue, which we will discover
> in another few weeks from now?
>
>
> --Udo
>
> On 8/26/19 3:10 PM, Ryan McMahon wrote:
> > Hi all,
> >
> > I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
> > 1.10.0 release branch.  The two JIRAs are related to the same root
> problem,
> > which I would classify as critical.  We discovered a case where a failed
> > client registration could lead to a memory leak in a server, eventually
> > causing the server to crash due to lack of memory.
> >
> > The issue is instigated by a ConcurrentModificationException due to
> > iteration of a non-thread safe collection while it is being mutated
> > (GEODE-7088).  This exception occurs when the client's queue image is
> being
> > copied from one server to the next during client registration, and it
> > causes the client's registration to fail.  The client would likely
> succeed
> > if it retried registration with that same server, but if it registers
> with
> > a different server, we end up leaking events to the client's registration
> > queue on the original server (GEODE-7089).
> >
> > The fix for GEODE-7088 is to use thread-safe collections for interested
> > clients in client update messages.  The fix for GEODE-7089 is to always
> > drain and remove the registration queue regardless of success or failure.
> > Together, these fixes prevent the failed registrations and memory leak.
> >
> > The SHAs for the fixes and tests in develop are:
> >
> > GEODE-7088
> > - 174af1d23fb7e09eb2bc2fa55479df854850fadb
> > - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
> >
> > GEODE-7089
> > - 5d0153ad4adb1612a1083673f98b1982819a6589
> >
> > This proposal is to cherry-pick these commits to 1.10.0 release branch.
> >
> > Thanks,
> > Ryan McMahon
> >
>


Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Owen Nichols
Hi Ryan, thank you for bringing your concern.

Geode's release process dictates a time-based schedule 
 to cut 
release branches.  The release/1.10.0 
 branch was already cut 
over 3 week ago, but the “critical fixes” rule does allow critical fixes to be 
brought to the release branch by proposal on the dev list, as you have done 
here.

If there is consensus from the Geode community that your proposed change 
satisfies the “critical fixes” rule, I will be happy to bring it to the 1.10.0 
release branch.

Due to the complexity of this change, could please open a PR against 
release/1.10.0 containing the exact changes you want to bring?

Regards
- Owen

> On Aug 26, 2019, at 3:10 PM, Ryan McMahon  wrote:
> 
> Hi all,
> 
> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
> 1.10.0 release branch.  The two JIRAs are related to the same root problem,
> which I would classify as critical.  We discovered a case where a failed
> client registration could lead to a memory leak in a server, eventually
> causing the server to crash due to lack of memory.
> 
> The issue is instigated by a ConcurrentModificationException due to
> iteration of a non-thread safe collection while it is being mutated
> (GEODE-7088).  This exception occurs when the client's queue image is being
> copied from one server to the next during client registration, and it
> causes the client's registration to fail.  The client would likely succeed
> if it retried registration with that same server, but if it registers with
> a different server, we end up leaking events to the client's registration
> queue on the original server (GEODE-7089).
> 
> The fix for GEODE-7088 is to use thread-safe collections for interested
> clients in client update messages.  The fix for GEODE-7089 is to always
> drain and remove the registration queue regardless of success or failure.
> Together, these fixes prevent the failed registrations and memory leak.
> 
> The SHAs for the fixes and tests in develop are:
> 
> GEODE-7088
> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
> 
> GEODE-7089
> - 5d0153ad4adb1612a1083673f98b1982819a6589
> 
> This proposal is to cherry-pick these commits to 1.10.0 release branch.
> 
> Thanks,
> Ryan McMahon



Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Udo Kohlmeyer

In order to better understand this request:

Is this an existing issue?

Why is it more critical to squeeze it into an existing (almost release) 
version of Apache Geode?


What guarantees do we have that this fix makes the application more 
stable compared to adding another hidden issue, which we will discover 
in another few weeks from now?



--Udo

On 8/26/19 3:10 PM, Ryan McMahon wrote:

Hi all,

I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
1.10.0 release branch.  The two JIRAs are related to the same root problem,
which I would classify as critical.  We discovered a case where a failed
client registration could lead to a memory leak in a server, eventually
causing the server to crash due to lack of memory.

The issue is instigated by a ConcurrentModificationException due to
iteration of a non-thread safe collection while it is being mutated
(GEODE-7088).  This exception occurs when the client's queue image is being
copied from one server to the next during client registration, and it
causes the client's registration to fail.  The client would likely succeed
if it retried registration with that same server, but if it registers with
a different server, we end up leaking events to the client's registration
queue on the original server (GEODE-7089).

The fix for GEODE-7088 is to use thread-safe collections for interested
clients in client update messages.  The fix for GEODE-7089 is to always
drain and remove the registration queue regardless of success or failure.
Together, these fixes prevent the failed registrations and memory leak.

The SHAs for the fixes and tests in develop are:

GEODE-7088
- 174af1d23fb7e09eb2bc2fa55479df854850fadb
- 5bb753a8f4ff2886acd8e62d6f51fea58e37881d

GEODE-7089
- 5d0153ad4adb1612a1083673f98b1982819a6589

This proposal is to cherry-pick these commits to 1.10.0 release branch.

Thanks,
Ryan McMahon



Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0

2019-08-26 Thread Ryan McMahon
Hi all,

I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
1.10.0 release branch.  The two JIRAs are related to the same root problem,
which I would classify as critical.  We discovered a case where a failed
client registration could lead to a memory leak in a server, eventually
causing the server to crash due to lack of memory.

The issue is instigated by a ConcurrentModificationException due to
iteration of a non-thread safe collection while it is being mutated
(GEODE-7088).  This exception occurs when the client's queue image is being
copied from one server to the next during client registration, and it
causes the client's registration to fail.  The client would likely succeed
if it retried registration with that same server, but if it registers with
a different server, we end up leaking events to the client's registration
queue on the original server (GEODE-7089).

The fix for GEODE-7088 is to use thread-safe collections for interested
clients in client update messages.  The fix for GEODE-7089 is to always
drain and remove the registration queue regardless of success or failure.
Together, these fixes prevent the failed registrations and memory leak.

The SHAs for the fixes and tests in develop are:

GEODE-7088
- 174af1d23fb7e09eb2bc2fa55479df854850fadb
- 5bb753a8f4ff2886acd8e62d6f51fea58e37881d

GEODE-7089
- 5d0153ad4adb1612a1083673f98b1982819a6589

This proposal is to cherry-pick these commits to 1.10.0 release branch.

Thanks,
Ryan McMahon


Re: Propose including GEODE-7085 in 1.10

2019-08-26 Thread Owen Nichols
There appears to be consensus that this is a critical fix.

The following commits have been brought into support/1.10.0 
 as the critical fix for 
GEODE-7085 :

git cherry-pick -x f58710116db1cd8c509b59a43ffa050a073234d7 

git cherry-pick -x f17931bf541fc0255112f713931388d9ee0bbc30 


GEODE-7085  has been marked 
as 'resolved in' 1.10.0.

Regards,
-Owen


> On Aug 26, 2019, at 10:50 AM, Eric Shu  wrote:
> 
> +1
> 
> On Mon, Aug 26, 2019 at 10:40 AM Alexander Murmann 
> wrote:
> 
>> +1
>> 
>> While it's not new, it's critical
>> 
>> On Mon, Aug 26, 2019 at 10:38 AM Juan José Ramos 
>> wrote:
>> 
>>> +1
>>> 
>>> On Mon, Aug 26, 2019 at 6:36 PM Dan Smith  wrote:
>>> 
 Hi,
 
 I'd like to propose including the fixes for GEODE-7085 into 1.10 (SHA's
 below). This is not a new issue, but it does result not being able to
 recover from disk without this fix if a cluster has more than 2 billion
 updates to a single bucket from a single member.
 
 SHAs: f587101  ,  f17931bf
 
 Thanks,
 -Dan
 
>>> 
>>> 
>>> --
>>> Juan José Ramos Cassella
>>> Senior Software Engineer
>>> Email: jra...@pivotal.io
>>> 
>> 



Re: Propose including GEODE-7085 in 1.10

2019-08-26 Thread Eric Shu
+1

On Mon, Aug 26, 2019 at 10:40 AM Alexander Murmann 
wrote:

> +1
>
> While it's not new, it's critical
>
> On Mon, Aug 26, 2019 at 10:38 AM Juan José Ramos 
> wrote:
>
> > +1
> >
> > On Mon, Aug 26, 2019 at 6:36 PM Dan Smith  wrote:
> >
> > > Hi,
> > >
> > > I'd like to propose including the fixes for GEODE-7085 into 1.10 (SHA's
> > > below). This is not a new issue, but it does result not being able to
> > > recover from disk without this fix if a cluster has more than 2 billion
> > > updates to a single bucket from a single member.
> > >
> > > SHAs: f587101  ,  f17931bf
> > >
> > > Thanks,
> > > -Dan
> > >
> >
> >
> > --
> > Juan José Ramos Cassella
> > Senior Software Engineer
> > Email: jra...@pivotal.io
> >
>


Re: Propose including GEODE-7085 in 1.10

2019-08-26 Thread Juan José Ramos
+1

On Mon, Aug 26, 2019 at 6:36 PM Dan Smith  wrote:

> Hi,
>
> I'd like to propose including the fixes for GEODE-7085 into 1.10 (SHA's
> below). This is not a new issue, but it does result not being able to
> recover from disk without this fix if a cluster has more than 2 billion
> updates to a single bucket from a single member.
>
> SHAs: f587101  ,  f17931bf
>
> Thanks,
> -Dan
>


-- 
Juan José Ramos Cassella
Senior Software Engineer
Email: jra...@pivotal.io


Propose including GEODE-7085 in 1.10

2019-08-26 Thread Dan Smith
Hi,

I'd like to propose including the fixes for GEODE-7085 into 1.10 (SHA's
below). This is not a new issue, but it does result not being able to
recover from disk without this fix if a cluster has more than 2 billion
updates to a single bucket from a single member.

SHAs: f587101  ,  f17931bf

Thanks,
-Dan


RE: Travis-ci & geode-native repo

2019-08-26 Thread Alberto Bustamante Reyes
that was fast: I contacted Travis support and they increased the timeout for us 


De: Alberto Bustamante Reyes 
Enviado: lunes, 26 de agosto de 2019 13:03
Para: dev@geode.apache.org 
Asunto: RE: Travis-ci & geode-native repo

Thanks for the answer Jake (sorry for the late answer, I was on holidays).

Do you know if Travis is offering that support from free?

The task of moving from Travis to Concourse sounds interesting, is there an 
existing ticket about it?

De: Jacob Barrett 
Enviado: miércoles, 7 de agosto de 2019 17:11
Para: dev@geode.apache.org 
Asunto: Re: Travis-ci & geode-native repo

We worked with Travis support to increase our timeout on the backend. As you 
can see from the Travis report it takes a long time to build. It doesn’t run 
any of the integration tests either. We use it as a litmus test on PRs mostly.

There is a future goal to role the build into the same pipeline as the java 
sources and do binary releases. If you would like to pick up that task it would 
allow you to run a full CI of your own too.

-jake



> On Aug 7, 2019, at 7:42 AM, Alberto Bustamante Reyes 
>  wrote:
>
> Hi,
>
> I have a question about the CI of the Geode C++ client. I would like to set 
> up Travis on a fork of the geode-native repo. I thought that the only 
> requirements to do so was to grant permissions to the repo, but our travis 
> tasks are failing due to the execution timeout is 50 minutes.
>
> I dont see in the travis.yaml any configuration about timeouts and I cannot 
> find any option to change that in Travis page. Could it be that geode-native 
> repo is not using the free version of Travis so it has a longer timeout? I 
> have seen that tasks takes aprox 1 hour and a half.
>
> Thanks!
>
> Alberto B.


RE: Travis-ci & geode-native repo

2019-08-26 Thread Alberto Bustamante Reyes
Thanks for the answer Jake (sorry for the late answer, I was on holidays).

Do you know if Travis is offering that support from free?

The task of moving from Travis to Concourse sounds interesting, is there an 
existing ticket about it?

De: Jacob Barrett 
Enviado: miércoles, 7 de agosto de 2019 17:11
Para: dev@geode.apache.org 
Asunto: Re: Travis-ci & geode-native repo

We worked with Travis support to increase our timeout on the backend. As you 
can see from the Travis report it takes a long time to build. It doesn’t run 
any of the integration tests either. We use it as a litmus test on PRs mostly.

There is a future goal to role the build into the same pipeline as the java 
sources and do binary releases. If you would like to pick up that task it would 
allow you to run a full CI of your own too.

-jake



> On Aug 7, 2019, at 7:42 AM, Alberto Bustamante Reyes 
>  wrote:
>
> Hi,
>
> I have a question about the CI of the Geode C++ client. I would like to set 
> up Travis on a fork of the geode-native repo. I thought that the only 
> requirements to do so was to grant permissions to the repo, but our travis 
> tasks are failing due to the execution timeout is 50 minutes.
>
> I dont see in the travis.yaml any configuration about timeouts and I cannot 
> find any option to change that in Travis page. Could it be that geode-native 
> repo is not using the free version of Travis so it has a longer timeout? I 
> have seen that tasks takes aprox 1 hour and a half.
>
> Thanks!
>
> Alberto B.


Need PR reviews

2019-08-26 Thread Mario Ivanac
Hi Geode dev,

we need review for following PRs:

Jira ticket:

https://issues.apache.org/jira/browse/GEODE-7086
PR:
https://github.com/apache/geode-native/pull/510

Jira ticket: 
https://issues.apache.org/jira/browse/GEODE-7039
PR:
https://github.com/apache/geode/pull/3955

Thanks,
Mario