*Hi Members of Lucene/Solr Developer Team,*
I am new to Solr source code. If I am wrong about it, I am sorry. But I
hope it could attract your attention.
I am 100% sure I found a bug on solr code in the CoreContainer.java
The Bug we encountered is like it:
-------------------------------------------------------------------------------------------------------------------------
1. transientSize=2
2. create 6 transient cores, and all of them are loadOnStarup=True
3.
Restart solr.
Then, you will find 2 cores loaded, let say core5, core6. They work fine.
But you cannot access core1, core2, core3, core4. There is no failure,
no error message. But you just need to keep waiting… Forever.
What make thing even funnier, you open the JConsole, you will find 6
core’s metrics are there — In a normal situation, there should only be two
cores metrics.
--------------------------------------------------------------------------------------------------------------------------
Reason
Read the log, you will find core1, core2, core3, core4 are closing —- But
actually they are not.
In Solr code, you will find core1-4 are just remove from the list in core
container. But the actual closing operation are not done. Because the
closer thread are hang!
I will point out the bug relative to that bug in the next email.
How to bypass it
There may be plenty ways to bypass this bug. And I am listing the 1 way:
1. make sure there is at lease one transient core(let call it
“SaviorCore”) who is loadOnStarup=False
2. When the Situation described above happen, send luke/mbean/any
request to load “SaviorCore”. Then you will find core1, core2, core3, core4
are actually closed in the JConsole.
BUG Description
I am now point out the bug on Solr code.
When the transient core is closing. It will do three things:
1. Remove the core from the cache, which they do it right.
2. Put the core in the queue, which they do it right.
3.
Send the signal to the CloserThread() to actually close it, which they
*fail* it.
It is a classical producer/consumer scenario. The authors are too
confident to ignore the BlockingQueue in the Java API designed by the Great
Doug Lea.
Code Relevant to the Solr Bug
There is a CloserThread in CoreContainer.java. The hanging problem will be
the modify lock and the wait().
while (! container.isShutDown()) {
synchronized (solrCores.getModifyLock()) { // need this so we
can wait and be awoken.
try {
solrCores.getModifyLock().wait();
} catch (InterruptedException e) {
// Well, if we've been told to stop, we will. Otherwise,
continue on and check to see if there are
// any cores to close.
}
}
for (SolrCore removeMe = solrCores.getCoreToClose();
removeMe != null && !container.isShutDown();
removeMe = solrCores.getCoreToClose()) {
try {
removeMe.close();
} finally {
solrCores.removeFromPendingOps(removeMe.getName());
}
}
}
This is way how SolrCores.java remove the eldest transient core in the
cache.
protected void allo
cateLazyCores(final int cacheSize, final SolrResourceLoader loader) {
if (cacheSize != Integer.MAX_VALUE) {
CoreContainer.log.info("Allocating transient cache for {}
transient cores", cacheSize);
transientCores = new LinkedHashMap<String, SolrCore>(cacheSize,
0.75f, true) {
@Override
protected boolean removeEldestEntry(Map.Entry<String,
SolrCore> eldest) {
if (size() > cacheSize) {
synchronized (modifyLock) {
SolrCore coreToClose = eldest.getValue();
logger.info("Closing transient core [{}]", coreToClose.getName());
pendingCloses.add(coreToClose); // Essentially just
queue this core up for closing.
modifyLock.notifyAll(); // Wakes up closer thread too
}
return true;
}
return false;
}
};
}
}
The notifyAll() signal may be loss if the CloseThread busy closing other
thread. So when CloserThread come back to wait(), it doesn’t know there is
another core awaiting closing.
My attempt
I modify the solr code to confirm my conclusion. I give the wait a time out
by
log.info("Edwin's Log: We are trying to fix the startup problem");
solrCores.getModifyLock().wait(*1000*);
log.info("end waiting.");
I build the jar and test it. *Situation gone, problem solved.*
Regards,
Junhao Li
20150612