[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921597#comment-16921597
 ] 

Hoss Man commented on SOLR-13709:
---------------------------------

just to be clear, my primary concern when i created this issue was that it was 
evident from the test failure logs that core reloading (and as erick points 
out: potentially other core level ops) could occur in a race condition with the 
core itself loading.

my comments about {{SolrCores.getCoreDescriptor(String)}} and if/when/why/how  
it should block on attempts to ccess a core by name if/while that core was 
loading were based *solely* on the exsting javadocs for that method.

if those javadocs are and have always been wrong, then trying to "fix" that 
method to match the javadocs isn't necessarily the best solution -- especially 
if doing so causes lots of other problems.  we can always just update the 
javadocs, making a note of when/why/how the value may be null, and audit the 
callers to ensure they are accounting for the possibility of null and handling 
that value in whatever way makes the most sense for the situation (throw NPE, 
throw a diff exception, fail a command, etc...)

i should point out, i have no idea if a "user level" Core RELOAD (or SWAP or 
UNLOAD) op (ie: something triggered externally via /admin/cores, or via 
overseer) also has this problem, or already accounts for the possibility that a 
core may not yet be loaded -- it may simply be that this particular ZkWatcher 
that registered by the core to watch the schema is itself broken, and should be 
checking some more explicit state to block and take no action until the core is 
fully loaded.

As far as testing...

[~erickerickson] - it's not really clear to me what/where/how you're currently 
trying to test this? ... as i mentioned, it's kind of a fluke that 
TestSolrCLIRunExample triggered this failure at all, and even when it did it 
didn't really "fail" in a reliable way that was oviously related to this 
specifit bug.  

I would suggest that a more robust way to test this would be with a more 
targeted non-cloud test, using a custom plugin (searcher handler, component, 
whatever...) that spins up a background thread to trigger schema updates in ZK 
(so that the problematic watcher which does a core reload on schema changes 
will then fire) and then the custom component should "stall" for some amount of 
time (ideally {{await}}-ing on something instead of an arbitrary sleep, but i 
haven't thought it through enough to know what exact condition it could await 
on) to force a delay in the completeion of the SolrCore loading.  Then your 
test just tries to initialize a SolrCore with a config that uses this custom 
plugin, and asserts that the SolrCore initializes fine *AND* that it 
(eventually) picks up the updated schema (via polling on the schema API?)

make sense?

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to