[jira] [Comment Edited] (SOLR-6086) Replica active during Warming
[ https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101141#comment-16101141 ] Shalin Shekhar Mangar edited comment on SOLR-6086 at 7/26/17 4:07 AM: -- Here is patch with the fix and tests. All in all, the following are the ways which cause this issue: # Replica is the only one in the shard -- hence on startup becomes leader, skips recovery, becomes active without waiting for warming to complete # Replica goes into recovery, discovers that it is the leader, becomes active without waiting for warming to complete # When peersync fails but replication reports that there is nothing to replicate then replica becomes active without waiting for warming to complete. This can happen in the following ways: ## peersync could be skipped if firstTime=false which can happen if recovering on startup we discover that last operation in ulog had flag gap set ## peersync could be skipped if firstTime=false because the previous recovery attempt failed and we're retrying recovery ## peersync could fail if the peersync request failed due to exceptions The fix in this patch (based on Tim's patch) ensure that there is a registered searcher before we publish any replica as active. Doing it inside RecoveryStrategy was not sufficient because that does not cover scenario 1. The tests in this patch inject failure into peer sync and simulate wrong index fingerprint computation to reproduce the problem. The scenarios that I could not reliably simulate were 2 and 3.1 but the fix should cover both of them. Tim's patch had a bug where openSearcher was called without returnSearcher=true. This can return a null future sometimes if there is an on deck searcher already and no new registration is needed. The correct fix is to send returnSearcher=true as well as waitFuture and check for both. was (Author: shalinmangar): Here is patch with the fix and tests. All in all, the following are the ways which cause this issue: # Replica is the only one in the shard -- hence on startup becomes leader, skips recovery, becomes active without waiting for warming to complete # Replica goes into recovery, discovers that it is the leader, becomes active without waiting for warming to complete # When peersync fails but replication reports that there is nothing to replicate then replica becomes active without waiting for warming to complete. This can happen in the following ways: ## peersync could be skipped if firstTime=false which can happen if recovering on startup we discover that last operation in ulog had flag gap set or if the previous recovery attempt failed and we're retrying recovery ## peersync could fail if the peersync request failed due to exceptions The fix in this patch (based on Tim's patch) ensure that there is a registered searcher before we publish any replica as active. Doing it inside RecoveryStrategy was not sufficient because that does not cover scenario 1. The tests in this patch inject failure into peer sync and simulate wrong index fingerprint computation to reproduce the problem. The scenarios that I could not reliably simulate were 2 and 3.1 but the fix should cover both of them. Tim's patch had a bug where openSearcher was called without returnSearcher=true. This can return a null future sometimes if there is an on deck searcher already and no new registration is needed. The correct fix is to send returnSearcher=true as well as waitFuture and check for both. > Replica active during Warming > - > > Key: SOLR-6086 > URL: https://issues.apache.org/jira/browse/SOLR-6086 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.6.1, 4.8.1 >Reporter: Ludovic Boutros >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-medium > Attachments: SOLR-6086.patch, SOLR-6086.patch, SOLR-6086.patch, > SOLR-6086-temp.patch > > Original Estimate: 72h > Remaining Estimate: 72h > > At least with Solr 4.6.1, replica are considered as active during the warming > process. > This means that if you restart a replica or create a new one, queries will > be send to this replica and the query will hang until the end of the warming > process (If cold searchers are not used). > You cannot add or restart a node silently anymore. > I think that the fact that the replica is active is not a bad thing. > But, the HttpShardHandler and the CloudSolrServer class should take the > warming process in account. > Currently, I have developped a new very simple component which check that a > searcher is registered. > I am also developping custom HttpShardHandler and CloudSolrServer classes > which will check the warming process in addition to the ACTIVE status in the > cluster state. > This seems to be more a
[jira] [Comment Edited] (SOLR-6086) Replica active during Warming
[ https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101141#comment-16101141 ] Shalin Shekhar Mangar edited comment on SOLR-6086 at 7/26/17 4:06 AM: -- Here is patch with the fix and tests. All in all, the following are the ways which cause this issue: # Replica is the only one in the shard -- hence on startup becomes leader, skips recovery, becomes active without waiting for warming to complete # Replica goes into recovery, discovers that it is the leader, becomes active without waiting for warming to complete # When peersync fails but replication reports that there is nothing to replicate then replica becomes active without waiting for warming to complete. This can happen in the following ways: ## peersync could be skipped if firstTime=false which can happen if recovering on startup we discover that last operation in ulog had flag gap set or if the previous recovery attempt failed and we're retrying recovery ## peersync could fail if the peersync request failed due to exceptions The fix in this patch (based on Tim's patch) ensure that there is a registered searcher before we publish any replica as active. Doing it inside RecoveryStrategy was not sufficient because that does not cover scenario 1. The tests in this patch inject failure into peer sync and simulate wrong index fingerprint computation to reproduce the problem. The scenarios that I could not reliably simulate were 2 and 3.1 but the fix should cover both of them. Tim's patch had a bug where openSearcher was called without returnSearcher=true. This can return a null future sometimes if there is an on deck searcher already and no new registration is needed. The correct fix is to send returnSearcher=true as well as waitFuture and check for both. was (Author: shalinmangar): Here is patch with the fix and tests. All in all, the following are the ways which cause this issue: # Replica is the only one in the shard -- hence on startup becomes leader, skips recovery, becomes active without waiting for warming to complete # Replica goes into recovery, discovers that it is the leader, becomes active without waiting for warming to complete # When peersync fails but replication reports that there is nothing to replicate then replica becomes active without waiting for warming to complete. This can happen in the following ways: ## peersync could be skipped if firstTime=false which can happen if recovering on startup we discover that last operation in ulog had flag gap set or if the previous recovery attempt failed and we're retrying recovery ## peersync could fail if the peersync request failed due to exceptions The fix in this patch (based on Tim's patch) ensure that there is a registered searcher before we publish any replica as active. Doing it inside RecoveryStrategy was not sufficient because that does not cover scenario 1. The tests in this patch inject failure into peer sync and simulate wrong index fingerprint computation to reproduce the problem. The one scenario that I could not reliably simulate was 2 and 3.1 but the fix should cover both of them. Tim's patch had a bug where openSearcher was called without returnSearcher=true. This can return a null future sometimes if there is an on deck searcher already and no new registration is needed. The correct fix is to send returnSearcher=true as well as waitFuture and check for both. > Replica active during Warming > - > > Key: SOLR-6086 > URL: https://issues.apache.org/jira/browse/SOLR-6086 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.6.1, 4.8.1 >Reporter: Ludovic Boutros >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-medium > Attachments: SOLR-6086.patch, SOLR-6086.patch, SOLR-6086.patch, > SOLR-6086-temp.patch > > Original Estimate: 72h > Remaining Estimate: 72h > > At least with Solr 4.6.1, replica are considered as active during the warming > process. > This means that if you restart a replica or create a new one, queries will > be send to this replica and the query will hang until the end of the warming > process (If cold searchers are not used). > You cannot add or restart a node silently anymore. > I think that the fact that the replica is active is not a bad thing. > But, the HttpShardHandler and the CloudSolrServer class should take the > warming process in account. > Currently, I have developped a new very simple component which check that a > searcher is registered. > I am also developping custom HttpShardHandler and CloudSolrServer classes > which will check the warming process in addition to the ACTIVE status in the > cluster state. > This seems to be more a workaround than a solution but that's all I can
[jira] [Comment Edited] (SOLR-6086) Replica active during Warming
[ https://issues.apache.org/jira/browse/SOLR-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004856#comment-14004856 ] ludovic Boutros edited comment on SOLR-6086 at 5/21/14 4:23 PM: I checked the differences in the logs and in the code. The problem occures when: - a node is restarted - Peer Sync failed (no /get handler for instance, should it become mandatory ?) - the node is already synced (nothing to replicate) or : - a node is restarted and this is the leader (I do not know if it only appends with a lonely leader...) - the node is already synced (nothing to replicate) For the first case, I think this is a side effect of the modification in SOLR-4965. If Peer Sync is succesfull, in the code an explicit commit is called. And there's a comment which says: {code:title=RecoveryStrategy.java|borderStyle=solid} // force open a new searcher core.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); {code} This is not the case if Peer Sync failed. Just adding this line is enough to correct this issue. Here is a patch with a test which reproduces the problem and the correction (to be applied to the branch 4x). I am working on the second case. was (Author: lboutros): I checked the differences in the logs and in the code. The problem occures when: - a node is restarted - Peer Sync failed (no /get handler for instance, should it become mandatory ?) - the node is already synced (nothing to replicate) or : - a node is restarted and this is the leader (I do not know if it only appends with a lonely leader...) - the node is already synced (nothing to replicate) For the first case, I think this is a side effect of the modification in SOLR-4965. If Peer Sync is succesfull, in the code an explicit commit is called. And there's a comment which says: {code:title=RecoveryStrategy.java|borderStyle=solid} // force open a new searcher core.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); {code} This is not the case if Peer Sync failed. Just adding this line is enough to correct this issue. Here is a patch with a test which reproduce the problem and the correction (to be applied to the branch 4x). I am working on the second case. Replica active during Warming - Key: SOLR-6086 URL: https://issues.apache.org/jira/browse/SOLR-6086 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.6.1, 4.8.1 Reporter: ludovic Boutros Attachments: SOLR-6086.patch At least with Solr 4.6.1, replica are considered as active during the warming process. This means that if you restart a replica or create a new one, queries will be send to this replica and the query will hang until the end of the warming process (If cold searchers are not used). You cannot add or restart a node silently anymore. I think that the fact that the replica is active is not a bad thing. But, the HttpShardHandler and the CloudSolrServer class should take the warming process in account. Currently, I have developped a new very simple component which check that a searcher is registered. I am also developping custom HttpShardHandler and CloudSolrServer classes which will check the warming process in addition to the ACTIVE status in the cluster state. This seems to be more a workaround than a solution but that's all I can do in this version. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org