[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

Erick Erickson (Jira) Fri, 23 Aug 2019 14:00:30 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914633#comment-16914633
 ]


Erick Erickson commented on SOLR-13709:
---------------------------------------

[~hossman] I'm hopping in the wayback machine. That said:

AFAICT, that comment in SolrCores.getCoreDescriptor is totally bogus and has 
been since at least 2015.

There are various lists that are maintained so multiple threads can open, close 
and reload cores etc. modifyLock is mostly used to coordinate multiple threads 
making changes to the lists, _not_ to deal with the underlying operations. So 
you're right, there is no blocking being done.

That said, getCoreDescriptor shouldn't be sensitive to whether the core is 
loaded or not, it should be solely about bookkeeping _descriptors_. But it's 
not. Over in CoreContainer, after all the cores have been discovered, there's 
this code:

{code}
        if (cd.isTransient() || !cd.isLoadOnStartup()) {
          solrCores.addCoreDescriptor(cd);
        } else if (asyncSolrCoreLoad) {
          solrCores.markCoreAsLoading(cd);
        }
        if (cd.isLoadOnStartup()) {
          futures.add(coreLoadExecutor.submit(() -> {
{code}

Eventually, if isLoadOnStartup is true the descriptor does get added to the 
core descriptor list as part of the core creation process. So some descriptors 
are available before and some after core discovery and that may be where the 
race condition is coming from.

I'll play around a bit with what happens if we just add all the descriptors to 
the internal lists before any cores are loaded, that seems like the right thing 
to do. After all, the lazily-loaded cores peacefully exist with a descriptor 
but no loaded core and "the right thing" happens when the core is referenced so 
I believe it should be OK.

Of course I have some fears that something else will pop out, but blocking on 
core load in getCoreDescriptor seems dangerous too, long-to-infinite delays if 
someone happens to ask for a core that is simply not there and never will be. 
And any timeout we choose will be wrong.

I'll assign this to myself for the nonce. If this doesn't break anything (and 
I'll beast several tests a lot over the weekend) then maybe we can circle back 
next week to see if any proposed changed make sense. 

How often do you see this failure? I'll put an e-mail filter in place to see 
how often we see "Unable to reload core" and collect some history about how 
often this happens so we can have some confidence it actually gets fixed if I 
can come up with some code.

Thanks for sleuthing this!



> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

Reply via email to