[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914948#comment-16914948
 ] 

Erick Erickson commented on SOLR-13709:
---------------------------------------

That doc hasn't been accurate since 2015 on a quick glance, so I don't trust it 
in the least. Also, in some testing I was doing last night there are many 
legitimate (apparently) times that getCoreDescriptor is called and returns 
null, so blocking forever would stop at least the tests cold. Particularly 
looking for things like ".system" collection. The comment is totally bogus, 
I'll change it if I can figure out a fix.

Your hypothesis is that CoreContainer.load() is on one thread and the watcher 
is on another, right? And loading, which could easily take a long time if there 
are a lot of cores especially if there are a limited number of threads loading 
them, isn't done, thus the race.

Off the top of my head, it'd be OK to block until CoreContainer.load is 
finished. The {code}status{code} is there specifically so a transient plugin 
can detect this state, there's no reason we can't use it other places. At that 
point, all core _descriptors_ will be available to getCoreDescriptor, whether 
or not the core is actually loaded or not (i.e. transient or lazy). In that 
case null should not be returned from getCoreDescriptor. I'll give that a whirl.

But there's one other thing that occurred to me. When a core is created there's 
a period during which the core descriptor is not available to getCoreDescriptor 
for an indeterminate amount of time. Do you think that'd also be a problem?

I'll try blocking until CoreContainer.load is finished and add some logging in 
both cases to see if we actually hit the state where CoreContainer.load() isn't 
finished and we can't find the descriptor and it isn't the .system collection, 
which seems to be called for a lot.

It'd actually be easier to debug if we can fail in this case. Is there an easy 
way for Solr code to know whether it's being run from a test? I'd like 
getCoreDescriptor to throw an error _only when testing_ for a while if it gets 
into this situation. I'd make this JIRA a blocker in that case so we'd be sure 
to clean that up before release.

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to