There appears to be consensus that these are critical fixes.

The following commits have been brought into release/1.10.0 
<https://github.com/apache/geode/tree/release/1.10.0> as the critical fix for 
GEODE-7088 <https://issues.apache.org/jira/browse/GEODE-7088>:

git cherry-pick -x 174af1d23fb7e09eb2bc2fa55479df854850fadb 
<https://github.com/apache/geode/commit/174af1d23fb7e09eb2bc2fa55479df854850fadb>
git cherry-pick -x 5bb753a8f4ff2886acd8e62d6f51fea58e37881d 
<https://github.com/apache/geode/commit/5bb753a8f4ff2886acd8e62d6f51fea58e37881d>

PR 3976 <https://github.com/apache/geode/pull/3976> has been merged to 
release/1.10.0 <https://github.com/apache/geode/tree/release/1.10.0> as the 
critical fix for GEODE-7089 <https://issues.apache.org/jira/browse/GEODE-7089>.

GEODE-7088 <https://issues.apache.org/jira/browse/GEODE-7088> and GEODE-7089 
<https://issues.apache.org/jira/browse/GEODE-7089> have been marked as 
'resolved in' 1.10.0.

-Owen

> On Aug 26, 2019, at 4:22 PM, Udo Kohlmeyer <u...@apache.com> wrote:
> 
> Thank you Ryan,
> 
> +1 for inclusion
> 
> On 8/26/19 3:33 PM, Ryan McMahon wrote:
>> Udo,
>> 
>> Here are inline answers to your questions:
>> 
>> *Is this an existing issue?*
>> 
>> Short answer - yes, but it has never been in a release version of Geode.
>> The leak was introduced as part of some changes to address handling
>> multiple concurrent registration requests for a given client on a single
>> server.  The issue is only seen if client registration fails which is not
>> particularly common, which is why we are only seeing it now after months of
>> testing.  The commit for that was introduced here on April 30th.
>> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
>> The ConcurrentModificationException issue (which ultimately causes the
>> registration to fail) was introduced on April 22nd here:
>> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
>> 
>> 
>> *Why is it more critical to squeeze it into an existing (almost
>> release) version of Apache Geode?*
>> 
>> Not sure I totally understand this question, but it is critical because the
>> leak can cause servers to crash due to OOM.  Again, because the problems
>> were introduced in late April and we haven't released Geode since then, so
>> I think it is very important to merge these fixes into 1.10.0 before we
>> release.
>> 
>> 
>> 
>> *What guarantees do we have that this fix makes the application more stable
>> compared to adding another hidden issue, which we will discover in another
>> few weeks from now?*
>> 
>> I added numerous tests for this scenario to ensure that the leak would
>> never happen regardless of the cause of a failed client registration.
>> There obviously is no way to 100% guarantee that there will be no issues
>> that arise from the fixes themselves, but our existing test coverage plus
>> the new tests I wrote should give us the confidence we need.
>> 
>> Thanks,
>> Ryan
>> 
>> On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <u...@apache.com> wrote:
>> 
>>> In order to better understand this request:
>>> 
>>> Is this an existing issue?
>>> 
>>> Why is it more critical to squeeze it into an existing (almost release)
>>> version of Apache Geode?
>>> 
>>> What guarantees do we have that this fix makes the application more
>>> stable compared to adding another hidden issue, which we will discover
>>> in another few weeks from now?
>>> 
>>> 
>>> --Udo
>>> 
>>> On 8/26/19 3:10 PM, Ryan McMahon wrote:
>>>> Hi all,
>>>> 
>>>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
>>>> 1.10.0 release branch.  The two JIRAs are related to the same root
>>> problem,
>>>> which I would classify as critical.  We discovered a case where a failed
>>>> client registration could lead to a memory leak in a server, eventually
>>>> causing the server to crash due to lack of memory.
>>>> 
>>>> The issue is instigated by a ConcurrentModificationException due to
>>>> iteration of a non-thread safe collection while it is being mutated
>>>> (GEODE-7088).  This exception occurs when the client's queue image is
>>> being
>>>> copied from one server to the next during client registration, and it
>>>> causes the client's registration to fail.  The client would likely
>>> succeed
>>>> if it retried registration with that same server, but if it registers
>>> with
>>>> a different server, we end up leaking events to the client's registration
>>>> queue on the original server (GEODE-7089).
>>>> 
>>>> The fix for GEODE-7088 is to use thread-safe collections for interested
>>>> clients in client update messages.  The fix for GEODE-7089 is to always
>>>> drain and remove the registration queue regardless of success or failure.
>>>> Together, these fixes prevent the failed registrations and memory leak.
>>>> 
>>>> The SHAs for the fixes and tests in develop are:
>>>> 
>>>> GEODE-7088
>>>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
>>>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
>>>> 
>>>> GEODE-7089
>>>> - 5d0153ad4adb1612a1083673f98b1982819a6589
>>>> 
>>>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
>>>> 
>>>> Thanks,
>>>> Ryan McMahon
>>>> 

Reply via email to