Hi all,

I am currently fixing a bug
<https://issues.apache.org/jira/browse/GEODE-6304> with the
HeapMemoryMonitor event tolerance feature, and came across a decision that
I thought would be more appropriate for the Geode dev list.

For those familiar with the feature, we are proposing that the default
gemfire.memoryEventTolerance config parameter value is changed from 1 to 0
so state transitions from normal to eviction or critical occur immediately
after reading a single heap-used-bytes event above threshold.  If you are
unfamiliar with the feature, read on.

The memory event tolerance feature addresses issues with some JVM distros
that result in sporadic, erroneously high heap-bytes-used readings.  The
feature was introduced to address this issue in the JRockit JVM, but it has
been found that other JVM distros are susceptible to this problem as well.

The feature prevents an "unexpected" state transition from a normal state
to an eviction or critical state by requiring N (configurable) consecutive
heap-used-byte events above threshold before changing states.  The current
default configuration is N = 5 for JRockit and N = 1 for all other JVMs.
In a non-JRockit JVM, this configuration permits a single event above
threshold WITHOUT causing a state transition.  In other words, by default,
we allow for a single bad outlier heap-used-bytes reading without going
into an eviction or critical state.

As part of this bug fix (which involves a failure to reset the tolerance
counter under some conditions), we opted to remove the special handling for
JRockit because JRockit is no longer supported.  After removing the JRockit
handling, we started re-evaluating if a default value of 1 is appropriate
for all other JVMs.  We are considering changing the default to 0, so state
transitions would occur immediately if an event above the threshold is
received.  If a user is facing one of these problematic JVMs, they can then
change the gemfire.memoryEventTolerance config parameter to increase the
tolerance.  Our concern is that the default today is potentially masking
bad heap readings without the user ever knowing.

To summarize, if we change the default from 1 to 0 it would potentially be
a change in behavior in that we would no longer be masking a single bad
heap-used-bytes reading i.e. no longer permitting a single outlier without
changing states.  The user can then decide whether to configure a non-zero
tolerance to address the situation.  Any thoughts on this change in
behavior?

Thanks,
Ryan

Reply via email to