[jira] [Comment Edited] (SOLR-6086) Replica active during Warming

2017-07-25 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101141#comment-16101141
 ] 

Shalin Shekhar Mangar edited comment on SOLR-6086 at 7/26/17 4:07 AM:
--

Here is patch with the fix and tests.

All in all, the following are the ways which cause this issue:
# Replica is the only one in the shard -- hence on startup becomes leader, 
skips recovery, becomes active without waiting for warming to complete
# Replica goes into recovery, discovers that it is the leader, becomes active 
without waiting for warming to complete
# When peersync fails but replication reports that there is nothing to 
replicate then replica becomes active without waiting for warming to complete. 
This can happen in the following ways:
## peersync could be skipped if firstTime=false which can happen if recovering 
on startup we discover that last operation in ulog had flag gap set
## peersync could be skipped if firstTime=false because the previous recovery 
attempt failed and we're retrying recovery
## peersync could fail if the peersync request failed due to exceptions

The fix in this patch (based on Tim's patch) ensure that there is a registered 
searcher before we publish any replica as active. Doing it inside 
RecoveryStrategy was not sufficient because that does not cover scenario 1. The 
tests in this patch inject failure into peer sync and simulate wrong index 
fingerprint computation to reproduce the problem. The scenarios that I could 
not reliably simulate were 2 and 3.1 but the fix should cover both of them. 
Tim's patch had a bug where openSearcher was called without 
returnSearcher=true. This can return a null future sometimes if there is an on 
deck searcher already and no new registration is needed. The correct fix is to 
send returnSearcher=true as well as waitFuture and check for both.


was (Author: shalinmangar):
Here is patch with the fix and tests.

All in all, the following are the ways which cause this issue:
# Replica is the only one in the shard -- hence on startup becomes leader, 
skips recovery, becomes active without waiting for warming to complete
# Replica goes into recovery, discovers that it is the leader, becomes active 
without waiting for warming to complete
# When peersync fails but replication reports that there is nothing to 
replicate then replica becomes active without waiting for warming to complete. 
This can happen in the following ways:
## peersync could be skipped if firstTime=false which can happen if recovering 
on startup we discover that last operation in ulog had flag gap set or if the 
previous recovery attempt failed and we're retrying recovery
## peersync could fail if the peersync request failed due to exceptions

The fix in this patch (based on Tim's patch) ensure that there is a registered 
searcher before we publish any replica as active. Doing it inside 
RecoveryStrategy was not sufficient because that does not cover scenario 1. The 
tests in this patch inject failure into peer sync and simulate wrong index 
fingerprint computation to reproduce the problem. The scenarios that I could 
not reliably simulate were 2 and 3.1 but the fix should cover both of them. 
Tim's patch had a bug where openSearcher was called without 
returnSearcher=true. This can return a null future sometimes if there is an on 
deck searcher already and no new registration is needed. The correct fix is to 
send returnSearcher=true as well as waitFuture and check for both.

> Replica active during Warming
> -
>
> Key: SOLR-6086
> URL: https://issues.apache.org/jira/browse/SOLR-6086
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.6.1, 4.8.1
>Reporter: Ludovic Boutros
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-medium
> Attachments: SOLR-6086.patch, SOLR-6086.patch, SOLR-6086.patch, 
> SOLR-6086-temp.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> At least with Solr 4.6.1, replica are considered as active during the warming 
> process.
> This means that if you restart a replica or create a new one, queries will  
> be send to this replica and the query will hang until the end of the warming  
> process (If cold searchers are not used).
> You cannot add or restart a node silently anymore.
> I think that the fact that the replica is active is not a bad thing.
> But, the HttpShardHandler and the CloudSolrServer class should take the 
> warming process in account.
> Currently, I have developped a new very simple component which check that a 
> searcher is registered.
> I am also developping custom HttpShardHandler and CloudSolrServer classes 
> which will check the warming process in addition to the ACTIVE status in the 
> cluster state.
> This seems to be more a 

[jira] [Comment Edited] (SOLR-6086) Replica active during Warming

2017-07-25 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101141#comment-16101141
 ] 

Shalin Shekhar Mangar edited comment on SOLR-6086 at 7/26/17 4:06 AM:
--

Here is patch with the fix and tests.

All in all, the following are the ways which cause this issue:
# Replica is the only one in the shard -- hence on startup becomes leader, 
skips recovery, becomes active without waiting for warming to complete
# Replica goes into recovery, discovers that it is the leader, becomes active 
without waiting for warming to complete
# When peersync fails but replication reports that there is nothing to 
replicate then replica becomes active without waiting for warming to complete. 
This can happen in the following ways:
## peersync could be skipped if firstTime=false which can happen if recovering 
on startup we discover that last operation in ulog had flag gap set or if the 
previous recovery attempt failed and we're retrying recovery
## peersync could fail if the peersync request failed due to exceptions

The fix in this patch (based on Tim's patch) ensure that there is a registered 
searcher before we publish any replica as active. Doing it inside 
RecoveryStrategy was not sufficient because that does not cover scenario 1. The 
tests in this patch inject failure into peer sync and simulate wrong index 
fingerprint computation to reproduce the problem. The scenarios that I could 
not reliably simulate were 2 and 3.1 but the fix should cover both of them. 
Tim's patch had a bug where openSearcher was called without 
returnSearcher=true. This can return a null future sometimes if there is an on 
deck searcher already and no new registration is needed. The correct fix is to 
send returnSearcher=true as well as waitFuture and check for both.


was (Author: shalinmangar):
Here is patch with the fix and tests.

All in all, the following are the ways which cause this issue:
# Replica is the only one in the shard -- hence on startup becomes leader, 
skips recovery, becomes active without waiting for warming to complete
# Replica goes into recovery, discovers that it is the leader, becomes active 
without waiting for warming to complete
# When peersync fails but replication reports that there is nothing to 
replicate then replica becomes active without waiting for warming to complete. 
This can happen in the following ways:
## peersync could be skipped if firstTime=false which can happen if recovering 
on startup we discover that last operation in ulog had flag gap set or if the 
previous recovery attempt failed and we're retrying recovery
## peersync could fail if the peersync request failed due to exceptions

The fix in this patch (based on Tim's patch) ensure that there is a registered 
searcher before we publish any replica as active. Doing it inside 
RecoveryStrategy was not sufficient because that does not cover scenario 1. The 
tests in this patch inject failure into peer sync and simulate wrong index 
fingerprint computation to reproduce the problem. The one scenario that I could 
not reliably simulate was 2 and 3.1 but the fix should cover both of them. 
Tim's patch had a bug where openSearcher was called without 
returnSearcher=true. This can return a null future sometimes if there is an on 
deck searcher already and no new registration is needed. The correct fix is to 
send returnSearcher=true as well as waitFuture and check for both.

> Replica active during Warming
> -
>
> Key: SOLR-6086
> URL: https://issues.apache.org/jira/browse/SOLR-6086
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.6.1, 4.8.1
>Reporter: Ludovic Boutros
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-medium
> Attachments: SOLR-6086.patch, SOLR-6086.patch, SOLR-6086.patch, 
> SOLR-6086-temp.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> At least with Solr 4.6.1, replica are considered as active during the warming 
> process.
> This means that if you restart a replica or create a new one, queries will  
> be send to this replica and the query will hang until the end of the warming  
> process (If cold searchers are not used).
> You cannot add or restart a node silently anymore.
> I think that the fact that the replica is active is not a bad thing.
> But, the HttpShardHandler and the CloudSolrServer class should take the 
> warming process in account.
> Currently, I have developped a new very simple component which check that a 
> searcher is registered.
> I am also developping custom HttpShardHandler and CloudSolrServer classes 
> which will check the warming process in addition to the ACTIVE status in the 
> cluster state.
> This seems to be more a workaround than a solution but that's all I can 

[jira] [Comment Edited] (SOLR-6086) Replica active during Warming

2014-05-21 Thread ludovic Boutros (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004856#comment-14004856
 ] 

ludovic Boutros edited comment on SOLR-6086 at 5/21/14 4:23 PM:


I checked the differences in the logs and in the code.

The problem occures when:
- a node is restarted 
- Peer Sync failed (no /get handler for instance, should it become mandatory 
?)
- the node is already synced (nothing to replicate)

or :

- a node is restarted and this is the leader (I do not know if it only appends 
with a lonely leader...)
- the node is already synced (nothing to replicate)

For the first case,

I think this is a side effect of the modification in SOLR-4965. 

If Peer Sync is succesfull, in the code an explicit commit is called. And 
there's a comment which says:

{code:title=RecoveryStrategy.java|borderStyle=solid}
// force open a new searcher
core.getUpdateHandler().commit(new CommitUpdateCommand(req, false));
{code}

This is not the case if Peer Sync failed.
Just adding this line is enough to correct this issue.

Here is a patch with a test which reproduces the problem and the correction (to 
be applied to the branch 4x).

I am working on the second case.


was (Author: lboutros):
I checked the differences in the logs and in the code.

The problem occures when:
- a node is restarted 
- Peer Sync failed (no /get handler for instance, should it become mandatory 
?)
- the node is already synced (nothing to replicate)

or :

- a node is restarted and this is the leader (I do not know if it only appends 
with a lonely leader...)
- the node is already synced (nothing to replicate)

For the first case,

I think this is a side effect of the modification in SOLR-4965. 

If Peer Sync is succesfull, in the code an explicit commit is called. And 
there's a comment which says:

{code:title=RecoveryStrategy.java|borderStyle=solid}
// force open a new searcher
core.getUpdateHandler().commit(new CommitUpdateCommand(req, false));
{code}

This is not the case if Peer Sync failed.
Just adding this line is enough to correct this issue.

Here is a patch with a test which reproduce the problem and the correction (to 
be applied to the branch 4x).

I am working on the second case.

 Replica active during Warming
 -

 Key: SOLR-6086
 URL: https://issues.apache.org/jira/browse/SOLR-6086
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.6.1, 4.8.1
Reporter: ludovic Boutros
 Attachments: SOLR-6086.patch


 At least with Solr 4.6.1, replica are considered as active during the warming 
 process.
 This means that if you restart a replica or create a new one, queries will  
 be send to this replica and the query will hang until the end of the warming  
 process (If cold searchers are not used).
 You cannot add or restart a node silently anymore.
 I think that the fact that the replica is active is not a bad thing.
 But, the HttpShardHandler and the CloudSolrServer class should take the 
 warming process in account.
 Currently, I have developped a new very simple component which check that a 
 searcher is registered.
 I am also developping custom HttpShardHandler and CloudSolrServer classes 
 which will check the warming process in addition to the ACTIVE status in the 
 cluster state.
 This seems to be more a workaround than a solution but that's all I can do in 
 this version.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org