[
https://issues.apache.org/jira/browse/SOLR-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869802#comment-13869802
]
Hoss Man commented on SOLR-5628:
--------------------------------
Comments from McCandless based on his investigation...
{quote}
OK, this seems to be a test bug: somehow, the main thread is removing
the entire index directory from the filesystem, before another thread
calls IW.close.
I can see the test opens 5 IW instances, and then things run fine, and
then the 5 IWs are closed in sequence in a thread, but in the main
thread is returning (randomized testing prints "OK 20.1s |
TestDistribDocBasedVersion.testDistribSearch") before the last 2 IWs
are closed.
So somewhere in the main thread, something is concurrently removing
the index directory before the last 2 IWs are closed.
I tried passing -Dtests.leaveTemporary=true and the test still fails,
so I don't think it's test-framework that's removing the index dir; I
think it's something somewhere in Solr's test infrastructure?
Probably there is just a missing call somewhere to cleanly shutdown
all cores before cleaning up?
{quote}
{quote}
OK, indeed, I tried passing -Dsolr.test.leavedatadir=true (thanks
Rob!) and that makes the test pass.
So somewhere this test just needs to properly close things before
removing filesystem directories.
{quote}
However, when i attempted to add {{-Dsolr.test.leavedatadir=true
-Dtests.leaveTemporary=true}} I could still reproduce the failure.
Meanwhile, the specific behavior of the failure seems to have changed slightly
between r1556643 and r1557798
Here's the failure on branch_4x as of r1556643 ...
{noformat}
[junit4] > Throwable #1:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=137, name=Lucene Merge Thread #0,
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
[junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
[junit4] > Caused by: java.lang.AssertionError
[junit4] > at
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3807)
[junit4] > at
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3791)
[junit4] > at
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3644)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)Throwable
#2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
uncaught exception in thread: Thread[id=138, name=Lucene Merge Thread #0,
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
[junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
[junit4] > Caused by: java.lang.AssertionError
[junit4] > at
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3807)
[junit4] > at
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3791)
[junit4] > at
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3644)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
[junit4] Completed in 29.76s, 1 test, 2 errors <<< FAILURES!
[junit4]
[junit4]
[junit4] Tests with failures:
[junit4] - org.apache.solr.cloud.TestDistribDocBasedVersion (suite)
[junit4]
[junit4]
[junit4] JVM J0: 0.41 .. 31.03 = 30.62s
[junit4] Execution time total: 31 seconds
[junit4] Tests summary: 1 suite, 1 test, 2 suite-level errors
{noformat}
... but if you then svn update to r1557798...
{noformat}
[junit4] 2> 167814 T9 ccr.ThreadLeakControl.checkThreadLeaks SEVERE 2
threads leaked from SUITE scope at
org.apache.solr.cloud.TestDistribDocBasedVersion:
[junit4] 2> 1) Thread[id=113, name=searcherExecutor-42-thread-1,
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
[junit4] 2> at sun.misc.Unsafe.park(Native Method)
[junit4] 2> at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
[junit4] 2> at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
[junit4] 2> at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4] 2> at java.lang.Thread.run(Thread.java:701)
[junit4] 2> 2) Thread[id=93, name=searcherExecutor-32-thread-1,
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
[junit4] 2> at sun.misc.Unsafe.park(Native Method)
[junit4] 2> at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
[junit4] 2> at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
[junit4] 2> at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4] 2> at java.lang.Thread.run(Thread.java:701)
...
[junit4] ERROR 0.00s | TestDistribDocBasedVersion (suite) <<<
[junit4] > Throwable #1: java.lang.AssertionError: ERROR:
SolrIndexSearcher opens=19 closes=17
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
[junit4] > at
org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:335)
[junit4] > at
org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:139)
[junit4] > at java.lang.Thread.run(Thread.java:701)Throwable #2:
com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE
scope at org.apache.solr.cloud.TestDistribDocBasedVersion:
[junit4] > 1) Thread[id=113, name=searcherExecutor-42-thread-1,
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
[junit4] > at sun.misc.Unsafe.park(Native Method)
[junit4] > at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
[junit4] > at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
[junit4] > at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4] > at java.lang.Thread.run(Thread.java:701)
[junit4] > 2) Thread[id=93, name=searcherExecutor-32-thread-1,
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
[junit4] > at sun.misc.Unsafe.park(Native Method)
[junit4] > at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
[junit4] > at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
[junit4] > at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4] > at java.lang.Thread.run(Thread.java:701)
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)Throwable #3:
com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie
threads that couldn't be terminated:
[junit4] > 1) Thread[id=113, name=searcherExecutor-42-thread-1,
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
[junit4] > at sun.misc.Unsafe.park(Native Method)
[junit4] > at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
[junit4] > at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
[junit4] > at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4] > at java.lang.Thread.run(Thread.java:701)
[junit4] > 2) Thread[id=93, name=searcherExecutor-32-thread-1,
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
[junit4] > at sun.misc.Unsafe.park(Native Method)
[junit4] > at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
[junit4] > at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
[junit4] > at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
[junit4] > at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4] > at java.lang.Thread.run(Thread.java:701)
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)Throwable #4:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=137, name=Lucene Merge Thread #0,
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
[junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
[junit4] > Caused by: java.lang.AssertionError
[junit4] > at
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3817)
[junit4] > at
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3801)
[junit4] > at
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)Throwable
#5: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
uncaught exception in thread: Thread[id=138, name=Lucene Merge Thread #0,
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
[junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError
[junit4] > at
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
[junit4] > Caused by: java.lang.AssertionError
[junit4] > at
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3817)
[junit4] > at
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3801)
[junit4] > at
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
[junit4] Completed in 171.33s, 1 test, 1 failure, 4 errors <<< FAILURES!
[junit4]
[junit4]
[junit4] Tests with failures:
[junit4] - org.apache.solr.cloud.TestDistribDocBasedVersion (suite)
[junit4]
[junit4]
[junit4] JVM J0: 0.41 .. 172.56 = 172.15s
[junit4] Execution time total: 2 minutes 52 seconds
[junit4] Tests summary: 1 suite, 1 test, 5 suite-level errors
{noformat}
The key question being: what changed on 4x between these two versions that
causes these zombie threads on top of the existing problem?
> Cloud test harness manifesting reproducible failures in
> TestDistribDocBasedVersion
> ----------------------------------------------------------------------------------
>
> Key: SOLR-5628
> URL: https://issues.apache.org/jira/browse/SOLR-5628
> Project: Solr
> Issue Type: Bug
> Reporter: Hoss Man
>
> Jenkins uncovered a test seed that causes a reproducible IndexWriter
> assertion failure in TestDistribDocBasedVersion on the 4x branch.
> McCandless helped dig in and believe that something in the way the solr test
> framework is setup is causing the test to delete the index dirs before the
> IndexWriter is being closed. Meanwhile, it appears that recent changes to 4x
> have caused the nature of the failure to change, so that now -- in addition
> to the IndexWriter assertion failure -- the test cleanup also stalls out and
> the test runner has to terminate some stalled threads.
> details to following in comment, but here's the reproduce line...
> {noformat}
> ant test -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=791402573DC76F3C
> -Dtests.multiplier=3 -Dtests.slow=true
> -Dtests.locale=ar_IQ -Dtests.timezone=Antarctica/Rothera
> -Dtests.file.encoding=US-ASCII
> {noformat}
> And the mail thread regarding this...
> https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]