[ 
https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-6086:
----------------------------------------
    Attachment: SOLR-6086.patch

Here is patch with the fix and tests.

All in all, the following are the ways which cause this issue:
# Replica is the only one in the shard -- hence on startup becomes leader, 
skips recovery, becomes active without waiting for warming to complete
# Replica goes into recovery, discovers that it is the leader, becomes active 
without waiting for warming to complete
# When peersync fails but replication reports that there is nothing to 
replicate then replica becomes active without waiting for warming to complete. 
This can happen in the following ways:
## peersync could be skipped if firstTime=false which can happen if recovering 
on startup we discover that last operation in ulog had flag gap set or if the 
previous recovery attempt failed and we're retrying recovery
## peersync could fail if the peersync request failed due to exceptions

The fix in this patch (based on Tim's patch) ensure that there is a registered 
searcher before we publish any replica as active. Doing it inside 
RecoveryStrategy was not sufficient because that does not cover scenario 1. The 
tests in this patch inject failure into peer sync and simulate wrong index 
fingerprint computation to reproduce the problem. The one scenario that I could 
not reliably simulate was 2 and 3.1 but the fix should cover both of them. 
Tim's patch had a bug where openSearcher was called without 
returnSearcher=true. This can return a null future sometimes if there is an on 
deck searcher already and no new registration is needed. The correct fix is to 
send returnSearcher=true as well as waitFuture and check for both.

> Replica active during Warming
> -----------------------------
>
>                 Key: SOLR-6086
>                 URL: https://issues.apache.org/jira/browse/SOLR-6086
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.6.1, 4.8.1
>            Reporter: Ludovic Boutros
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-medium
>         Attachments: SOLR-6086.patch, SOLR-6086.patch, SOLR-6086.patch, 
> SOLR-6086-temp.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> At least with Solr 4.6.1, replica are considered as active during the warming 
> process.
> This means that if you restart a replica or create a new one, queries will  
> be send to this replica and the query will hang until the end of the warming  
> process (If cold searchers are not used).
> You cannot add or restart a node silently anymore.
> I think that the fact that the replica is active is not a bad thing.
> But, the HttpShardHandler and the CloudSolrServer class should take the 
> warming process in account.
> Currently, I have developped a new very simple component which check that a 
> searcher is registered.
> I am also developping custom HttpShardHandler and CloudSolrServer classes 
> which will check the warming process in addition to the ACTIVE status in the 
> cluster state.
> This seems to be more a workaround than a solution but that's all I can do in 
> this version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to