keith-turner opened a new issue, #4783:
URL: https://github.com/apache/accumulo/issues/4783

   **Describe the bug**
   
   During work on #4781 an issue was encountered that caused tablet servers to 
die with an OOME.  Changes were made in #4781 to avoid the problem w/o 
understanding it.  Tracked the cause of this down to the following.
   
    * FlakyAmpleServerContext+TestAmple were creating a lot of new Hadoop 
configuration objects
    * In VolumeManagerImpl [this 
cache](https://github.com/apache/accumulo/blob/c1b1781f0e30a4b110946c9429a99e4918d08261/server/base/src/main/java/org/apache/accumulo/server/fs/VolumeManagerImpl.java#L84-L86)
 was hanging on to all of the Hadoop config objects for up to 24hrs
   
   This problem does not currently happen without the injected test code when 
running integration test. 
   
   **To Reproduce**
   
   Apply the following changes to c1b1781f0e30a4b110946c9429a99e4918d08261 and 
run VolumeFlakyAmpleIT
   
   ```diff
   diff --git 
a/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
 
b/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
   index ba0cfe926f..71d995532c 100644
   --- 
a/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
   +++ 
b/minicluster/src/main/java/org/apache/accumulo/miniclusterImpl/MiniAccumuloClusterImpl.java
   @@ -449,6 +449,8 @@ public class MiniAccumuloClusterImpl implements 
AccumuloCluster {
          jvmOpts.add("-Dzookeeper.jmx.log4j.disable=true");
        }
        jvmOpts.add("-Xmx" + config.getMemory(serverType));
   +    // When server dies, create a heap dump for analysis
   +    jvmOpts.add("-XX:+HeapDumpOnOutOfMemoryError");
        if (configOverrides != null && !configOverrides.isEmpty()) {
          File siteFile =
              Files.createTempFile(config.getConfDir().toPath(), "accumulo", 
".properties").toFile();
   diff --git 
a/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java 
b/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
   index 1262c00fa0..9c961f256c 100644
   --- a/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
   +++ b/test/src/main/java/org/apache/accumulo/test/VolumeFlakyAmpleIT.java
   @@ -44,8 +44,11 @@ public class VolumeFlakyAmpleIT extends VolumeITBase {
        // The regular version of this test creates 100 tablets. However 100 
tablets and FlakyAmple
        // causing each tablet operation take longer results in longer test 
runs times. So lower the
        // number of tablets to 10 to speed up the test with flaky ample.
   +
   +    // increase the number of tablets in the test as this will increase the 
number of conditional
   +    // mutations written using FlakyAmple
        TreeSet<Text> splits = new TreeSet<>();
   -    for (int i = 10; i < 100; i += 10) {
   +    for (int i = 1; i < 100; i += 1) {
          splits.add(new Text(String.format("%06d", i * 100)));
        }
        return splits;
   diff --git 
a/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
 
b/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
   index 209cfbbbb8..15e2f294ac 100644
   --- 
a/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
   +++ 
b/test/src/main/java/org/apache/accumulo/test/ample/FlakyAmpleServerContext.java
   @@ -26,8 +26,6 @@ import org.apache.accumulo.core.metadata.schema.Ample;
    import org.apache.accumulo.server.ServerContext;
    import org.apache.accumulo.test.ample.metadata.TestAmple;
    
   -import com.google.common.base.Suppliers;
   -
    /**
     * A goal of this class is to exercise the lambdas passed to
     * {@link 
org.apache.accumulo.core.metadata.schema.Ample.ConditionalTabletMutator#submit(Ample.RejectionHandler)}.
   @@ -44,10 +42,13 @@ public class FlakyAmpleServerContext extends 
ServerContext {
        // seemed to hang around and cause OOME and process death. Did not 
track down why they were
        // hanging around, but decided to avoid creating a new instance of 
TestAmple each time Ample is
        // requested in order to avoid creating those hadoop config objects.
   -    ampleSupplier = Suppliers.memoize(() -> TestAmple.create(
   -        this, Map.of(Ample.DataLevel.USER, Ample.DataLevel.USER.metaTable(),
   -            Ample.DataLevel.METADATA, Ample.DataLevel.METADATA.metaTable()),
   -        FlakyInterceptor::new));
   +
   +    // changed this to create a TestAmple object per request for Ample to 
trigger the OOME
   +    ampleSupplier =
   +        () -> TestAmple.create(
   +            this, Map.of(Ample.DataLevel.USER, 
Ample.DataLevel.USER.metaTable(),
   +                Ample.DataLevel.METADATA, 
Ample.DataLevel.METADATA.metaTable()),
   +            FlakyInterceptor::new);
      }
    
      @Override
   ```
   
   **Expected behavior**
   
   The static cache should not keep objects alive that nothing else references.
   
   The cache is assuming there are only a few Hadoop configuration objects in 
the JVM created by Accumulo code.  If this assumption is not met it causes 
problems for the static cache. Could try to detect deviations from this 
assumption and/or try to make the cache use weak references if possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to