[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921469#comment-16921469
 ] 

Erick Erickson commented on SOLR-13709:
---------------------------------------

Making some progress on this, at least I'm getting local beasting to fail. 2 
out of 1,000 runs, so it takes 8-10 hours to try something. So far, 
CoreContainer.load() is completing properly and there are still failures. I 
still think blocking until it's CoreContainerl.load() is complete is a good 
idea.

What I _think_ I'm seeing now is the following sequence:
 - CoreContainer.load() completes successfully
 - a core create operation is initiated (this happens relatively frequently in 
tests of course)
 - SolrCores.getCoreDescriptor is called before the core creation is complete, 
the coreDescriptor list gets updated fairly late in the core creation process.

Relatively early in the core creation process though, the core is added to 
pendingCoreOps, a list of cores that are in transition. My latest hypothesis is 
that it's during this interval that SolrCores.getCoreDescriptor is called and 
returns null. I have some debugging logging in place to test, and a loop in 
place to wait until a core moves out of pendingCoreOps before returning from 
SolrCores.getCoreDescriptor.

There's still a small window I think between the time 
CoreContainer.create(core) is called from some client and the entry gets _in_ 
the pendingCoreOps list. First I'll see if checking pendingCoreOps has an entry 
upon occasion for a core whose descriptor is being asked for, then see if I can 
close that window.

The other thing I'm seeing is that failures happen in several places and have 
several different stack traces. I think one that I saw was from metrics, 
another from update, etc. All are fairly consistent with my proposed steps, but 
then my other three hypotheses have been too.

I'll be traveling Thursday and Friday, then the week after is Activate so this 
may languish if I can't get some closure by Sunday.

I still have a problem with the fact that the ".system" collection is regularly 
asked for in SolrCores.getDescriptor, even when it's never going to be there. 
Anything I do in getCoreDescriptor that waits is susceptible to waiting on an 
event that'll never occur. Of course I can time-limit the wait, but the example 
of asking for the ".system" core just means that there may be another case. 
Waiting while any asked-for core is in pendingCoreOps is fine since that 
condition will end as soon as the core is loaded (or fails).

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to