[
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921597#comment-16921597
]
Hoss Man commented on SOLR-13709:
---------------------------------
just to be clear, my primary concern when i created this issue was that it was
evident from the test failure logs that core reloading (and as erick points
out: potentially other core level ops) could occur in a race condition with the
core itself loading.
my comments about {{SolrCores.getCoreDescriptor(String)}} and if/when/why/how
it should block on attempts to ccess a core by name if/while that core was
loading were based *solely* on the exsting javadocs for that method.
if those javadocs are and have always been wrong, then trying to "fix" that
method to match the javadocs isn't necessarily the best solution -- especially
if doing so causes lots of other problems. we can always just update the
javadocs, making a note of when/why/how the value may be null, and audit the
callers to ensure they are accounting for the possibility of null and handling
that value in whatever way makes the most sense for the situation (throw NPE,
throw a diff exception, fail a command, etc...)
i should point out, i have no idea if a "user level" Core RELOAD (or SWAP or
UNLOAD) op (ie: something triggered externally via /admin/cores, or via
overseer) also has this problem, or already accounts for the possibility that a
core may not yet be loaded -- it may simply be that this particular ZkWatcher
that registered by the core to watch the schema is itself broken, and should be
checking some more explicit state to block and take no action until the core is
fully loaded.
As far as testing...
[~erickerickson] - it's not really clear to me what/where/how you're currently
trying to test this? ... as i mentioned, it's kind of a fluke that
TestSolrCLIRunExample triggered this failure at all, and even when it did it
didn't really "fail" in a reliable way that was oviously related to this
specifit bug.
I would suggest that a more robust way to test this would be with a more
targeted non-cloud test, using a custom plugin (searcher handler, component,
whatever...) that spins up a background thread to trigger schema updates in ZK
(so that the problematic watcher which does a core reload on schema changes
will then fire) and then the custom component should "stall" for some amount of
time (ideally {{await}}-ing on something instead of an arbitrary sleep, but i
haven't thought it through enough to know what exact condition it could await
on) to force a delay in the completeion of the SolrCore loading. Then your
test just tries to initialize a SolrCore with a config that uses this custom
plugin, and asserts that the SolrCore initializes fine *AND* that it
(eventually) picks up the updated schema (via polling on the schema API?)
make sense?
> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Assignee: Erick Erickson
> Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that
> there may be a race condition when attempting to re-load a SolrCore while the
> core is currently in the process of (re)loading that can leave the SolrCore
> in an unusable state.
> Details to follow...
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]