keith-turner opened a new issue, #4783:
URL: https://github.com/apache/accumulo/issues/4783
**Describe the bug**
During work on #4781 an issue was encountered that caused tablet servers to
die with an OOME. Changes were made in #4781 to avoid the problem w/o
understanding it. Tracked the cause of this down to the following.
* FlakyAmpleServerContext+TestAmple were creating a lot of new Hadoop
configuration objects
* In VolumeManagerImpl [this
cache](https://github.com/apache/accumulo/blob/c1b1781f0e30a4b110946c9429a99e4918d08261/server/base/src/main/java/org/apache/accumulo/server/fs/VolumeManagerImpl.java#L84-L86)
was hanging on to all of the Hadoop config objects for up to 24hrs
This problem does not currently happen without the injected test code when
running integration test.
**To Reproduce**
Apply the following changes to c1b1781f0e30a4b110946c9429a99e4918d08261 and
run VolumeFlakyAmpleIT
```diff
diff --git
a/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
b/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
index ba0cfe926f..71d995532c 100644
---
a/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
+++
b/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
@@ -449,6 +449,8 @@ public class MiniAccumuloClusterImpl implements
AccumuloCluster {
jvmOpts.add("-Dzookeeper.jmx.log4j.disable=true");
}
jvmOpts.add("-Xmx" + config.getMemory(serverType));
+ // When server dies, create a heap dump for analysis
+ jvmOpts.add("-XX:+HeapDumpOnOutOfMemoryError");
if (configOverrides != null && !configOverrides.isEmpty()) {
File siteFile =
Files.createTempFile(config.getConfDir().toPath(), "accumulo",
".properties").toFile();
diff --git
a/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
b/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
index 1262c00fa0..9c961f256c 100644
--- a/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
+++ b/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
@@ -44,8 +44,11 @@ public class VolumeFlakyAmpleIT extends VolumeITBase {
// The regular version of this test creates 100 tablets. However 100
tablets and FlakyAmple
// causing each tablet operation take longer results in longer test
runs times. So lower the
// number of tablets to 10 to speed up the test with flaky ample.
+
+ // increase the number of tablets in the test as this will increase the
number of conditional
+ // mutations written using FlakyAmple
TreeSet<Text> splits = new TreeSet<>();
- for (int i = 10; i < 100; i += 10) {
+ for (int i = 1; i < 100; i += 1) {
splits.add(new Text(String.format("%06d", i * 100)));
}
return splits;
diff --git
a/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
b/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
index 209cfbbbb8..15e2f294ac 100644
---
a/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
+++
b/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
@@ -26,8 +26,6 @@ import org.apache.accumulo.core.metadata.schema.Ample;
import org.apache.accumulo.server.ServerContext;
import org.apache.accumulo.test.ample.metadata.TestAmple;
-import com.google.common.base.Suppliers;
-
/**
* A goal of this class is to exercise the lambdas passed to
* {@link
org.apache.accumulo.core.metadata.schema.Ample.ConditionalTabletMutator#submit(Ample.RejectionHandler)}.
@@ -44,10 +42,13 @@ public class FlakyAmpleServerContext extends
ServerContext {
// seemed to hang around and cause OOME and process death. Did not
track down why they were
// hanging around, but decided to avoid creating a new instance of
TestAmple each time Ample is
// requested in order to avoid creating those hadoop config objects.
- ampleSupplier = Suppliers.memoize(() -> TestAmple.create(
- this, Map.of(Ample.DataLevel.USER, Ample.DataLevel.USER.metaTable(),
- Ample.DataLevel.METADATA, Ample.DataLevel.METADATA.metaTable()),
- FlakyInterceptor::new));
+
+ // changed this to create a TestAmple object per request for Ample to
trigger the OOME
+ ampleSupplier =
+ () -> TestAmple.create(
+ this, Map.of(Ample.DataLevel.USER,
Ample.DataLevel.USER.metaTable(),
+ Ample.DataLevel.METADATA,
Ample.DataLevel.METADATA.metaTable()),
+ FlakyInterceptor::new);
}
@Override
```
**Expected behavior**
The static cache should not keep objects alive that nothing else references.
The cache is assuming there are only a few Hadoop configuration objects in
the JVM created by Accumulo code. If this assumption is not met it causes
problems for the static cache. Could try to detect deviations from this
assumption and/or try to make the cache use weak references if possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]