Hi all, I am currently fixing a bug <https://issues.apache.org/jira/browse/GEODE-6304> with the HeapMemoryMonitor event tolerance feature, and came across a decision that I thought would be more appropriate for the Geode dev list.
For those familiar with the feature, we are proposing that the default gemfire.memoryEventTolerance config parameter value is changed from 1 to 0 so state transitions from normal to eviction or critical occur immediately after reading a single heap-used-bytes event above threshold. If you are unfamiliar with the feature, read on. The memory event tolerance feature addresses issues with some JVM distros that result in sporadic, erroneously high heap-bytes-used readings. The feature was introduced to address this issue in the JRockit JVM, but it has been found that other JVM distros are susceptible to this problem as well. The feature prevents an "unexpected" state transition from a normal state to an eviction or critical state by requiring N (configurable) consecutive heap-used-byte events above threshold before changing states. The current default configuration is N = 5 for JRockit and N = 1 for all other JVMs. In a non-JRockit JVM, this configuration permits a single event above threshold WITHOUT causing a state transition. In other words, by default, we allow for a single bad outlier heap-used-bytes reading without going into an eviction or critical state. As part of this bug fix (which involves a failure to reset the tolerance counter under some conditions), we opted to remove the special handling for JRockit because JRockit is no longer supported. After removing the JRockit handling, we started re-evaluating if a default value of 1 is appropriate for all other JVMs. We are considering changing the default to 0, so state transitions would occur immediately if an event above the threshold is received. If a user is facing one of these problematic JVMs, they can then change the gemfire.memoryEventTolerance config parameter to increase the tolerance. Our concern is that the default today is potentially masking bad heap readings without the user ever knowing. To summarize, if we change the default from 1 to 0 it would potentially be a change in behavior in that we would no longer be masking a single bad heap-used-bytes reading i.e. no longer permitting a single outlier without changing states. The user can then decide whether to configure a non-zero tolerance to address the situation. Any thoughts on this change in behavior? Thanks, Ryan