[ 
https://issues.apache.org/jira/browse/SOLR-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704507#comment-13704507
 ] 

Dawid Weiss commented on SOLR-5022:
-----------------------------------

It's been both fun and a learning experience debugging this. I have good news 
and bad news:

- the good news is: it's not a memory leak,
- the bad news is:  it's not a memory leak :)

h2. the debugging process

Clearly permgen is one of the most wicked JVM features - it's damn hard to 
figure out what its
content really is (I didn't find a way to dump it from within a running process 
without invoking
the debugging interface, which in turn starts its own threads, etc.).

The way I approached the problem (which may be useful for future reference) is 
as follows:

- I wrote a short aspect that injects itself before any String.intern is called:
{code}
    pointcut targetMethod(): call(String java.lang.String.intern());

    before() : targetMethod()
    {
        final JoinPoint jp = thisJoinPoint;
        System.out.println("String#intern() from: " 
            + jp.getSourceLocation().getWithinType() + " => "
            + jp.getTarget());
    }
{code}

- then I added a Before and After hook (executed before/after each test) that 
dumped memory pools:
{code}
        System.out.println("Memdump#from: " 
            + this.getClass().getName() + " => ");

        for (MemoryPoolMXBean bean : ManagementFactory.getMemoryPoolMXBeans()) {
            MemoryUsage usage = bean.getUsage();
            System.out.println(
                String.format(Locale.ENGLISH,
                    "%20s - I:%7.1f U:%7.1f M:%7.1f",
                    bean.getName(),
                    usage.getInit() / (1024 * 1024.0d),
                    usage.getUsed() / (1024 * 1024.0d),
                    usage.getMax()  / (1024 * 1024.0d)));
        }
{code}

- then I ran solr test in one JVM, with the following parameters:
{code}
ant -Dtests.seed=143E6CCF7E42064B 
    -Dtests.leaveTemporary=true 
    -Dtests.jvms=1 
    -Dargs="-javaagent:aspectjweaver.jar -XX:+UseCompressedOops 
-XX:+UseParallelGC -XX:+TraceClassLoading"
    test-core
{code}
I had to modify common-build.xml to include aspectj classpath entries (and the 
aspect itself) because 
I couldn't get it to work by passing -cp via the args parameter (didn't look 
too deeply since it's a hack).

- I again modified common-build.xml and added:
{code}
sysouts="true" jvmoutputaction="pipe,ignore"
{code}
to junit4:junit4 task's attributes so that all output is emitted to temporary 
files under a build folder.

h2. the results

>From the dumped output streams we have the following weave info indicating 
>which methods run String.intern:
{code}
$ grep "String.intern(" junit4-J0-20130710_122632_726.syserr

in Type 'com.ctc.wstx.util.SymbolTable'
in Type 'com.ctc.wstx.util.SymbolTable'
in Type 'com.ctc.wstx.util.InternCache'
in Type 'org.apache.solr.response.JSONWriter'
in Type 'org.apache.lucene.codecs.lucene3x.TermBuffer'
in Type 'org.apache.lucene.codecs.lucene3x.Lucene3xFields$PreTermsEnum'
in Type 'org.apache.solr.common.luke.FieldFlag'
in Type 'org.apache.solr.search.DocSetPerf'
in Type 'org.joda.time.tz.ZoneInfoProvider'
in Type 'org.joda.time.tz.DateTimeZoneBuilder$PrecalculatedZone'
in Type 'org.joda.time.tz.DateTimeZoneBuilder$Recurrence'
in Type 'org.joda.time.chrono.GJLocaleSymbols'
in Type 'org.apache.solr.request.TestWriterPerf'
{code}

These indeed intern a lot of strings but they're typically the same so they 
don't amount to the growth of permgen.
This in turn is very steady over the runtime of the test JVM:
{code}
$ egrep -o -e "PS Perm Gen[^%]+" junit4-J0-20130710_122632_726.sysout

PS Perm Gen - I:   20.8 U:   15.9 M:   82.0
PS Perm Gen - I:   20.8 U:   16.1 M:   82.0
PS Perm Gen - I:   20.8 U:   34.6 M:   82.0
PS Perm Gen - I:   20.8 U:   37.7 M:   82.0
PS Perm Gen - I:   20.8 U:   37.7 M:   82.0
PS Perm Gen - I:   20.8 U:   37.9 M:   82.0
PS Perm Gen - I:   20.8 U:   37.9 M:   82.0
PS Perm Gen - I:   20.8 U:   38.0 M:   82.0
...
PS Perm Gen - I:   20.8 U:   77.3 M:   82.0
PS Perm Gen - I:   20.8 U:   77.4 M:   82.0
PS Perm Gen - I:   20.8 U:   77.4 M:   82.0
PS Perm Gen - I:   20.8 U:   77.4 M:   82.0
PS Perm Gen - I:   20.8 U:   77.4 M:   82.0
{code}

I stands for "initial", U for "used", M for "maximum". So you can see that the 
permgen is nearly-exhausted in this run 
(it didn't OOM though). Out of curiosity I checked for class loading markers -- 
classes are loaded throughout the whole run,
because each test loads different fragments of the code. So even at the end of 
the run you get things like:
{code}
Memdump#from: org.apache.solr.update.processor.ParsingFieldUpdateProcessorsTest 
=> 
          Code Cache - I:    2.4 U:   27.0 M:   48.0
       PS Eden Space - I:   62.9 U:   68.7 M:  167.9
   PS Survivor Space - I:   10.4 U:    0.8 M:    0.8
          PS Old Gen - I:  167.5 U:   97.8 M:  341.4
         PS Perm Gen - I:   20.8 U:   72.7 M:   82.0
[Loaded org.joda.time.ReadWritableInstant from 
file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.ReadWritableDateTime from 
file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.MutableDateTime from 
file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.field.AbstractReadableInstantFieldProperty from 
file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.MutableDateTime$Property from 
file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.chrono.GJLocaleSymbols from 
file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
Memdump#from: org.apache.solr.update.processor.ParsingFieldUpdateProcessorsTest 
=> 
{code}

It seems like the problem leading to the permgen is just the huge number of 
classes being loaded under a single class loader (and these
classes cannot be unloaded because they're either cross-referenced or something 
else is holding on to them).

h2. verifying the class-number hipothesis

It was interesting to answer the question: how much permgen space would it take 
to load all these classes *without running tests*? I wrote
a small utility that parses the output log with class loading information:
{code}
...
[Loaded org.apache.lucene.index.DocTermOrds from 
file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
[Loaded org.apache.lucene.search.FieldCacheImpl$DocTermOrdsCache from 
file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
[Loaded org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache from 
file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
[Loaded org.apache.lucene.search.FieldCache$2 from 
file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/] 
...
{code}
and turns it into a custom URLClassLoader with the URLs that appear in those 
entries. Then the tool attempts to load all the referenced classes (and run 
initializers) 
but does not do anything else. It also dumps the permgen state every 100 
classes. The results are as follows:
{code}
# 10240 classes from 61 sources.
    0 -           Code Cache - I:    2.4 U:    0.6 M:   48.0
    0 -        PS Eden Space - I:   62.9 U:   54.1 M: 1319.1
    0 -    PS Survivor Space - I:   10.4 U:    2.6 M:   10.4
    0 -           PS Old Gen - I:  167.5 U:    0.0 M: 2680.0
    0 -          PS Perm Gen - I:   20.8 U:    4.1 M:   82.0
...
 1400 -           Code Cache - I:    2.4 U:    0.7 M:   48.0
 1400 -        PS Eden Space - I:   62.9 U:   18.8 M: 1319.1
 1400 -    PS Survivor Space - I:   10.4 U:    3.5 M:   10.4
 1400 -           PS Old Gen - I:  167.5 U:    0.0 M: 2680.0
 1400 -          PS Perm Gen - I:   20.8 U:   12.0 M:   82.0
...
 6200 -           Code Cache - I:    2.4 U:    1.3 M:   48.0
 6200 -        PS Eden Space - I:   62.9 U:   33.3 M: 1319.1
 6200 -    PS Survivor Space - I:   10.4 U:   10.4 M:   10.4
 6200 -           PS Old Gen - I:  167.5 U:   10.7 M: 2680.0
 6200 -          PS Perm Gen - I:   20.8 U:   45.6 M:   82.0
...
10239 -           Code Cache - I:    2.4 U:    1.5 M:   48.0
10239 -        PS Eden Space - I:   62.9 U:    4.8 M: 1319.1
10239 -    PS Survivor Space - I:   10.4 U:   10.4 M:   10.4
10239 -           PS Old Gen - I:  167.5 U:   21.7 M: 2680.0
10239 -          PS Perm Gen - I:   20.8 U:   71.5 M:   82.0
{code}

which, if you forgot already, very nicely matches the result acquired from the 
real test run (classes plus
interned strings):
{code}
Memdump#from: org.apache.solr.util.FileUtilsTest => 
          Code Cache - I:    2.4 U:   24.3 M:   48.0
       PS Eden Space - I:   62.9 U:   39.6 M:  166.5
   PS Survivor Space - I:   10.4 U:    1.1 M:    2.1
          PS Old Gen - I:  167.5 U:  173.9 M:  341.4
         PS Perm Gen - I:   20.8 U:   77.4 M:   82.0
{code}

I repeated the above results with JDK 1.7 (64 bit) and the required permgen 
space is smaller:
{code}
10239 -           Code Cache - I:    2.4 U:    1.3 M:   48.0
10239 -        PS Eden Space - I:   62.9 U:   97.5 M: 1319.1
10239 -    PS Survivor Space - I:   10.4 U:   10.4 M:   10.4
10239 -           PS Old Gen - I:  167.5 U:   27.5 M: 2680.0
10239 -          PS Perm Gen - I:   20.8 U:   59.9 M:   82.0
{code}

which may be a hint why we're seing the problem only on 1.6 -- we're running 
very close to the limit and 1.6
is less space-conservative.

I also ran it with jrockit (for fun):
{code}
10239 -              Nursery - I:   -0.0 U:   13.0 M: 2918.4
10239 -            Old Space - I:   64.0 U:   62.0 M: 3072.0
10239 -         Class Memory - I:    0.5 U:   68.7 M:   -0.0
10239 -    ClassBlock Memory - I:    0.5 U:    4.0 M:   -0.0
{code}

and with J9:
{code}
10239 -        class storage - I:    0.0 U:   41.3 M:   -0.0
10239 -       JIT code cache - I:    0.0 U:    8.0 M:   -0.0
10239 -       JIT data cache - I:    0.0 U:    0.3 M:   -0.0
10239 - miscellaneous non-heap storage - I:    0.0 U:    0.0 M:   -0.0
10239 -            Java heap - I:    4.0 U:   38.3 M:  512.0
{code}

h2. conclusions

So it's the number of classes that is the core of the problem. The workarounds 
in the order of difficulty:
- increase max permgen for hotspot (other JVMs should be able to do it 
dynamically),
- split solr core tests into multiple ant sub-calls so that they don't run in a 
single JVM,
- change the runner to support running tests in isolation (for example max-N 
tests per JVM, then relaunch)
- probably a lot more options here, depending on your current creativity levels 
:)

                
> PermGen exhausted test failures on Jenkins.
> -------------------------------------------
>
>                 Key: SOLR-5022
>                 URL: https://issues.apache.org/jira/browse/SOLR-5022
>             Project: Solr
>          Issue Type: Test
>          Components: Tests
>            Reporter: Mark Miller
>             Fix For: 5.0, 4.4
>
>         Attachments: intern-count-win.txt, SOLR-5022.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to