[
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914948#comment-16914948
]
Erick Erickson commented on SOLR-13709:
---------------------------------------
That doc hasn't been accurate since 2015 on a quick glance, so I don't trust it
in the least. Also, in some testing I was doing last night there are many
legitimate (apparently) times that getCoreDescriptor is called and returns
null, so blocking forever would stop at least the tests cold. Particularly
looking for things like ".system" collection. The comment is totally bogus,
I'll change it if I can figure out a fix.
Your hypothesis is that CoreContainer.load() is on one thread and the watcher
is on another, right? And loading, which could easily take a long time if there
are a lot of cores especially if there are a limited number of threads loading
them, isn't done, thus the race.
Off the top of my head, it'd be OK to block until CoreContainer.load is
finished. The {code}status{code} is there specifically so a transient plugin
can detect this state, there's no reason we can't use it other places. At that
point, all core _descriptors_ will be available to getCoreDescriptor, whether
or not the core is actually loaded or not (i.e. transient or lazy). In that
case null should not be returned from getCoreDescriptor. I'll give that a whirl.
But there's one other thing that occurred to me. When a core is created there's
a period during which the core descriptor is not available to getCoreDescriptor
for an indeterminate amount of time. Do you think that'd also be a problem?
I'll try blocking until CoreContainer.load is finished and add some logging in
both cases to see if we actually hit the state where CoreContainer.load() isn't
finished and we can't find the descriptor and it isn't the .system collection,
which seems to be called for a lot.
It'd actually be easier to debug if we can fail in this case. Is there an easy
way for Solr code to know whether it's being run from a test? I'd like
getCoreDescriptor to throw an error _only when testing_ for a while if it gets
into this situation. I'd make this JIRA a blocker in that case so we'd be sure
to clean that up before release.
> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Assignee: Erick Erickson
> Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that
> there may be a race condition when attempting to re-load a SolrCore while the
> core is currently in the process of (re)loading that can leave the SolrCore
> in an unusable state.
> Details to follow...
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]