[ 
https://issues.apache.org/jira/browse/SOLR-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869802#comment-13869802
 ] 

Hoss Man commented on SOLR-5628:
--------------------------------

Comments from McCandless based on his investigation...

{quote}
OK, this seems to be a test bug: somehow, the main thread is removing
the entire index directory from the filesystem, before another thread
calls IW.close.

I can see the test opens 5 IW instances, and then things run fine, and
then the 5 IWs are closed in sequence in a thread, but in the main
thread is returning (randomized testing prints "OK      20.1s |
TestDistribDocBasedVersion.testDistribSearch") before the last 2 IWs
are closed.

So somewhere in the main thread, something is concurrently removing
the index directory before the last 2 IWs are closed.

I tried passing -Dtests.leaveTemporary=true and the test still fails,
so I don't think it's test-framework that's removing the index dir; I
think it's something somewhere in Solr's test infrastructure?

Probably there is just a missing call somewhere to cleanly shutdown
all cores before cleaning up?
{quote}
{quote}
OK, indeed, I tried passing -Dsolr.test.leavedatadir=true (thanks
Rob!) and that makes the test pass.

So somewhere this test just needs to properly close things before
removing filesystem directories.
{quote}

However, when i attempted to add {{-Dsolr.test.leavedatadir=true 
-Dtests.leaveTemporary=true}} I could still reproduce the failure.

Meanwhile, the specific behavior of the failure seems to have changed slightly 
between  r1556643 and r1557798

Here's the failure on branch_4x as of r1556643 ...

{noformat}
   [junit4]    > Throwable #1: 
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=137, name=Lucene Merge Thread #0, 
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.AssertionError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3807)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3791)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3644)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)Throwable
 #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
uncaught exception in thread: Thread[id=138, name=Lucene Merge Thread #0, 
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.AssertionError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3807)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3791)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3644)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
   [junit4] Completed in 29.76s, 1 test, 2 errors <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures:
   [junit4]   - org.apache.solr.cloud.TestDistribDocBasedVersion (suite)
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.41 ..    31.03 =    30.62s
   [junit4] Execution time total: 31 seconds
   [junit4] Tests summary: 1 suite, 1 test, 2 suite-level errors
{noformat}

... but if you then svn update to r1557798...

{noformat}
   [junit4]   2> 167814 T9 ccr.ThreadLeakControl.checkThreadLeaks SEVERE 2 
threads leaked from SUITE scope at 
org.apache.solr.cloud.TestDistribDocBasedVersion: 
   [junit4]   2>           1) Thread[id=113, name=searcherExecutor-42-thread-1, 
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]   2>                at sun.misc.Unsafe.park(Native Method)
   [junit4]   2>                at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]   2>                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]   2>                at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]   2>                at java.lang.Thread.run(Thread.java:701)
   [junit4]   2>           2) Thread[id=93, name=searcherExecutor-32-thread-1, 
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]   2>                at sun.misc.Unsafe.park(Native Method)
   [junit4]   2>                at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]   2>                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]   2>                at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]   2>                at java.lang.Thread.run(Thread.java:701)

...

   [junit4] ERROR   0.00s | TestDistribDocBasedVersion (suite) <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: ERROR: 
SolrIndexSearcher opens=19 closes=17
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:335)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:139)
   [junit4]    >        at java.lang.Thread.run(Thread.java:701)Throwable #2: 
com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE 
scope at org.apache.solr.cloud.TestDistribDocBasedVersion: 
   [junit4]    >    1) Thread[id=113, name=searcherExecutor-42-thread-1, 
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    >    2) Thread[id=93, name=searcherExecutor-32-thread-1, 
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)Throwable #3: 
com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie 
threads that couldn't be terminated:
   [junit4]    >    1) Thread[id=113, name=searcherExecutor-42-thread-1, 
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    >    2) Thread[id=93, name=searcherExecutor-32-thread-1, 
state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)Throwable #4: 
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=137, name=Lucene Merge Thread #0, 
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.AssertionError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3817)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3801)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)Throwable
 #5: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
uncaught exception in thread: Thread[id=138, name=Lucene Merge Thread #0, 
state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.AssertionError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3817)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3801)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    >        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
   [junit4] Completed in 171.33s, 1 test, 1 failure, 4 errors <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures:
   [junit4]   - org.apache.solr.cloud.TestDistribDocBasedVersion (suite)
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.41 ..   172.56 =   172.15s
   [junit4] Execution time total: 2 minutes 52 seconds
   [junit4] Tests summary: 1 suite, 1 test, 5 suite-level errors

{noformat}

The key question being: what changed on 4x between these two versions that 
causes these zombie threads on top of the existing problem?


> Cloud test harness manifesting reproducible failures in 
> TestDistribDocBasedVersion
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5628
>                 URL: https://issues.apache.org/jira/browse/SOLR-5628
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>
> Jenkins uncovered a test seed that causes a reproducible IndexWriter 
> assertion failure in TestDistribDocBasedVersion on the 4x branch.  
> McCandless helped dig in and believe that something in the way the solr test 
> framework is setup is causing the test to delete the index dirs before the 
> IndexWriter is being closed.  Meanwhile, it appears that recent changes to 4x 
> have caused the nature of the failure to change, so that now -- in addition 
> to the IndexWriter assertion failure -- the test cleanup also stalls out and 
> the test runner has to terminate some stalled threads.
> details to following in comment, but here's the reproduce line...
> {noformat}
> ant test  -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=791402573DC76F3C 
> -Dtests.multiplier=3 -Dtests.slow=true
> -Dtests.locale=ar_IQ -Dtests.timezone=Antarctica/Rothera 
> -Dtests.file.encoding=US-ASCII
> {noformat}
> And the mail thread regarding this...
> https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to