OK, in working on SOLR-4196 I'm exercising opening/closing cores as never before. I have a little stress program that does about the worst thing possible, essentially opens and closes a core for every request. It has a bunch of query and update threads running simultaneously that pick a random core and do a query or update. I've got a bunch of code that, I think, prevents any attempt to open _or_ close a core while it is being either opened or closed by another thread (but I'm verifying).
It runs fine for a couple of hours, then hits a race condition. I was able to get a stack trace (see below). CloserThread.run(CoreContainer.java:1920) (second thread below) is, indeed, new code. The stress-test program is updating cores (which may not be loaded) like crazy and doing queries on other random cores. It's perfectly possible to be updating a core that's in the process of being closed, I was counting on the ref counting to make this OK... The cores are transient in a limited cache, so they come and go. It looks like I'm trying to close a core at the same time an update has come in, but I'm not sure whether this is something that should be prevented from the new code or is an underlying problem. So a couple of questions: 1> SOLR-4196 has a whole series of improvements that even let us get here. Running the stress test program against current trunk barfs before having time to hit this condition, so the current state is an improvement. What do you think about me checking 4196 in and opening a separate JIRA for this issue? 2> Any suggestions on what direction to go next? If it's something easy, I can just fold it into this patch. 3> Am I just going about things bass-ackwards? Not an unusual state of affairs unfortunately..... NOTE: The current patch for SOLR-4196 isn't the one running with this code, there are a couple more things I want change. Mostly I'm asking if someone familiar with the code where the race is encountered has a quick fix.... Thanks, Erick Found one Java-level deadlock: ============================= "commitScheduler-122579-thread-1": waiting to lock monitor 7f87c3076d38 (object 78b379a28, a org.apache.solr.update.DefaultSolrCoreState), which is held by "Thread-15" "Thread-15": waiting for ownable synchronizer 765e84638, (a java.util.concurrent.locks.ReentrantLock$NonfairSync), which is held by "commitScheduler-122579-thread-1" Java stack information for the threads listed above: =================================================== "commitScheduler-122579-thread-1": at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:82) - waiting to lock <78b379a28> (a org.apache.solr.update.DefaultSolrCoreState) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1354) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:573) - locked <76aa46f58> (a java.lang.Object) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680) "Thread-15": at sun.misc.Unsafe.park(Native Method) - parking to wait for <765e84638> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) at org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:680) at org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:68) - locked <78b379a28> (a org.apache.solr.update.DefaultSolrCoreState) at org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:289) - locked <78b379a28> (a org.apache.solr.update.DefaultSolrCoreState) at org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:68) - locked <78b379a28> (a org.apache.solr.update.DefaultSolrCoreState) at org.apache.solr.core.SolrCore.close(SolrCore.java:975) at org.apache.solr.core.CloserThread.run(CoreContainer.java:1920) Found 1 deadlock.
