There appears to be consensus that these are critical fixes. The following commits have been brought into release/1.10.0 <https://github.com/apache/geode/tree/release/1.10.0> as the critical fix for GEODE-7088 <https://issues.apache.org/jira/browse/GEODE-7088>:
git cherry-pick -x 174af1d23fb7e09eb2bc2fa55479df854850fadb <https://github.com/apache/geode/commit/174af1d23fb7e09eb2bc2fa55479df854850fadb> git cherry-pick -x 5bb753a8f4ff2886acd8e62d6f51fea58e37881d <https://github.com/apache/geode/commit/5bb753a8f4ff2886acd8e62d6f51fea58e37881d> PR 3976 <https://github.com/apache/geode/pull/3976> has been merged to release/1.10.0 <https://github.com/apache/geode/tree/release/1.10.0> as the critical fix for GEODE-7089 <https://issues.apache.org/jira/browse/GEODE-7089>. GEODE-7088 <https://issues.apache.org/jira/browse/GEODE-7088> and GEODE-7089 <https://issues.apache.org/jira/browse/GEODE-7089> have been marked as 'resolved in' 1.10.0. -Owen > On Aug 26, 2019, at 4:22 PM, Udo Kohlmeyer <u...@apache.com> wrote: > > Thank you Ryan, > > +1 for inclusion > > On 8/26/19 3:33 PM, Ryan McMahon wrote: >> Udo, >> >> Here are inline answers to your questions: >> >> *Is this an existing issue?* >> >> Short answer - yes, but it has never been in a release version of Geode. >> The leak was introduced as part of some changes to address handling >> multiple concurrent registration requests for a given client on a single >> server. The issue is only seen if client registration fails which is not >> particularly common, which is why we are only seeing it now after months of >> testing. The commit for that was introduced here on April 30th. >> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8 >> The ConcurrentModificationException issue (which ultimately causes the >> registration to fail) was introduced on April 22nd here: >> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f >> >> >> *Why is it more critical to squeeze it into an existing (almost >> release) version of Apache Geode?* >> >> Not sure I totally understand this question, but it is critical because the >> leak can cause servers to crash due to OOM. Again, because the problems >> were introduced in late April and we haven't released Geode since then, so >> I think it is very important to merge these fixes into 1.10.0 before we >> release. >> >> >> >> *What guarantees do we have that this fix makes the application more stable >> compared to adding another hidden issue, which we will discover in another >> few weeks from now?* >> >> I added numerous tests for this scenario to ensure that the leak would >> never happen regardless of the cause of a failed client registration. >> There obviously is no way to 100% guarantee that there will be no issues >> that arise from the fixes themselves, but our existing test coverage plus >> the new tests I wrote should give us the confidence we need. >> >> Thanks, >> Ryan >> >> On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <u...@apache.com> wrote: >> >>> In order to better understand this request: >>> >>> Is this an existing issue? >>> >>> Why is it more critical to squeeze it into an existing (almost release) >>> version of Apache Geode? >>> >>> What guarantees do we have that this fix makes the application more >>> stable compared to adding another hidden issue, which we will discover >>> in another few weeks from now? >>> >>> >>> --Udo >>> >>> On 8/26/19 3:10 PM, Ryan McMahon wrote: >>>> Hi all, >>>> >>>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the >>>> 1.10.0 release branch. The two JIRAs are related to the same root >>> problem, >>>> which I would classify as critical. We discovered a case where a failed >>>> client registration could lead to a memory leak in a server, eventually >>>> causing the server to crash due to lack of memory. >>>> >>>> The issue is instigated by a ConcurrentModificationException due to >>>> iteration of a non-thread safe collection while it is being mutated >>>> (GEODE-7088). This exception occurs when the client's queue image is >>> being >>>> copied from one server to the next during client registration, and it >>>> causes the client's registration to fail. The client would likely >>> succeed >>>> if it retried registration with that same server, but if it registers >>> with >>>> a different server, we end up leaking events to the client's registration >>>> queue on the original server (GEODE-7089). >>>> >>>> The fix for GEODE-7088 is to use thread-safe collections for interested >>>> clients in client update messages. The fix for GEODE-7089 is to always >>>> drain and remove the registration queue regardless of success or failure. >>>> Together, these fixes prevent the failed registrations and memory leak. >>>> >>>> The SHAs for the fixes and tests in develop are: >>>> >>>> GEODE-7088 >>>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb >>>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d >>>> >>>> GEODE-7089 >>>> - 5d0153ad4adb1612a1083673f98b1982819a6589 >>>> >>>> This proposal is to cherry-pick these commits to 1.10.0 release branch. >>>> >>>> Thanks, >>>> Ryan McMahon >>>>