Mihaly Toth created SOLR-10904:
----------------------------------

             Summary: Unnecessary waiting during failover in case of failed 
core creation
                 Key: SOLR-10904
                 URL: https://issues.apache.org/jira/browse/SOLR-10904
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: master (7.0)
            Reporter: Mihaly Toth


Background failover thread checks for bad replicas. In case one is found it 
tries to create it on another node. Then it waits for the new replica to show 
up in the cluster state. It waits even if the core creation (initiated by 
itself) fails. 

This situation does not occur on the happy path of the failover cases because 
the new node was marked as alive. But in case the cluster is in an instable 
state, or user is restarting the new node, or overseer is overloaded this extra 
wait will result in holding up this failover thread.

Proposed solution may be
# wait for the result of the core creation
# only if previous step is successful proceed to wait for cluster state change

In code:
{code}
try {
  Future<Boolean> future = updateExecutor.submit(() -> 
createSolrCore(collection, createUrl, dataDir, ulogDir, coreNodeName, coreName, 
shardId));
  future.get(30000L, TimeUnit.MILLISECONDS);
} catch (InterruptedException | ExecutionException | TimeoutException e) {
  log.error("Error creating core", e);
  return false;
} finally {
  MDC.remove("OverseerAutoReplicaFailoverThread.createUrl");
}
{code}

In such case we could consider moving core creation into the failover thread 
from the updateExecutor.

I can post a patch with these changes if the solution seems appropriate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to