[
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914633#comment-16914633
]
Erick Erickson commented on SOLR-13709:
---------------------------------------
[~hossman] I'm hopping in the wayback machine. That said:
AFAICT, that comment in SolrCores.getCoreDescriptor is totally bogus and has
been since at least 2015.
There are various lists that are maintained so multiple threads can open, close
and reload cores etc. modifyLock is mostly used to coordinate multiple threads
making changes to the lists, _not_ to deal with the underlying operations. So
you're right, there is no blocking being done.
That said, getCoreDescriptor shouldn't be sensitive to whether the core is
loaded or not, it should be solely about bookkeeping _descriptors_. But it's
not. Over in CoreContainer, after all the cores have been discovered, there's
this code:
{code}
if (cd.isTransient() || !cd.isLoadOnStartup()) {
solrCores.addCoreDescriptor(cd);
} else if (asyncSolrCoreLoad) {
solrCores.markCoreAsLoading(cd);
}
if (cd.isLoadOnStartup()) {
futures.add(coreLoadExecutor.submit(() -> {
{code}
Eventually, if isLoadOnStartup is true the descriptor does get added to the
core descriptor list as part of the core creation process. So some descriptors
are available before and some after core discovery and that may be where the
race condition is coming from.
I'll play around a bit with what happens if we just add all the descriptors to
the internal lists before any cores are loaded, that seems like the right thing
to do. After all, the lazily-loaded cores peacefully exist with a descriptor
but no loaded core and "the right thing" happens when the core is referenced so
I believe it should be OK.
Of course I have some fears that something else will pop out, but blocking on
core load in getCoreDescriptor seems dangerous too, long-to-infinite delays if
someone happens to ask for a core that is simply not there and never will be.
And any timeout we choose will be wrong.
I'll assign this to myself for the nonce. If this doesn't break anything (and
I'll beast several tests a lot over the weekend) then maybe we can circle back
next week to see if any proposed changed make sense.
How often do you see this failure? I'll put an e-mail filter in place to see
how often we see "Unable to reload core" and collect some history about how
often this happens so we can have some confidence it actually gets fixed if I
can come up with some code.
Thanks for sleuthing this!
> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that
> there may be a race condition when attempting to re-load a SolrCore while the
> core is currently in the process of (re)loading that can leave the SolrCore
> in an unusable state.
> Details to follow...
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]