[
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921469#comment-16921469
]
Erick Erickson commented on SOLR-13709:
---------------------------------------
Making some progress on this, at least I'm getting local beasting to fail. 2
out of 1,000 runs, so it takes 8-10 hours to try something. So far,
CoreContainer.load() is completing properly and there are still failures. I
still think blocking until it's CoreContainerl.load() is complete is a good
idea.
What I _think_ I'm seeing now is the following sequence:
- CoreContainer.load() completes successfully
- a core create operation is initiated (this happens relatively frequently in
tests of course)
- SolrCores.getCoreDescriptor is called before the core creation is complete,
the coreDescriptor list gets updated fairly late in the core creation process.
Relatively early in the core creation process though, the core is added to
pendingCoreOps, a list of cores that are in transition. My latest hypothesis is
that it's during this interval that SolrCores.getCoreDescriptor is called and
returns null. I have some debugging logging in place to test, and a loop in
place to wait until a core moves out of pendingCoreOps before returning from
SolrCores.getCoreDescriptor.
There's still a small window I think between the time
CoreContainer.create(core) is called from some client and the entry gets _in_
the pendingCoreOps list. First I'll see if checking pendingCoreOps has an entry
upon occasion for a core whose descriptor is being asked for, then see if I can
close that window.
The other thing I'm seeing is that failures happen in several places and have
several different stack traces. I think one that I saw was from metrics,
another from update, etc. All are fairly consistent with my proposed steps, but
then my other three hypotheses have been too.
I'll be traveling Thursday and Friday, then the week after is Activate so this
may languish if I can't get some closure by Sunday.
I still have a problem with the fact that the ".system" collection is regularly
asked for in SolrCores.getDescriptor, even when it's never going to be there.
Anything I do in getCoreDescriptor that waits is susceptible to waiting on an
event that'll never occur. Of course I can time-limit the wait, but the example
of asking for the ".system" core just means that there may be another case.
Waiting while any asked-for core is in pendingCoreOps is fine since that
condition will end as soon as the core is loaded (or fails).
> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Assignee: Erick Erickson
> Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that
> there may be a race condition when attempting to re-load a SolrCore while the
> core is currently in the process of (re)loading that can leave the SolrCore
> in an unusable state.
> Details to follow...
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]