[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009179#comment-15009179 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1714844 from [~noble.paul] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1714844 ] SOLR-7569 test failure fix > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009186#comment-15009186 ] Mike Drob commented on SOLR-7569: - bq. Can we close this now, and create new JIRAs for future enhancements? Mark Miller, Shalin Shekhar Mangar, Noble Paul? I agree with this. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009164#comment-15009164 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1714842 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1714842 ] SOLR-7569 test failure fix > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009213#comment-15009213 ] Mark Miller commented on SOLR-7569: --- This was only reopened because the test was ignored due to reverting SOLR-7989. With that resolved, this should be fine. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006835#comment-15006835 ] Ishan Chattopadhyaya commented on SOLR-7569: Can we close this now, and create new JIRAs for future enhancements? [~mark.mil...@oblivion.ch], [~shalinmangar], [~noble.paul]? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002251#comment-15002251 ] Ishan Chattopadhyaya commented on SOLR-7569: bq. I've taken a crack at making SOLR-7989 work. Thanks! bq. Perhaps the last thing the API should do is run through each shard and see if the registered leader is DOWN, and if it is make it ACTIVE (preferably by asking it to publish itself as ACTIVE - we don't want to publish for someone else). If the call waits around to make sure all the leaders come up, this should be simple. This makes sense. I think this is something that Shalin alluded to (please excuse me if I'm mistaken) when he said, {{1. Leader is live but 'down' -> mark it 'active'}}. The suggestion for the replicas to mark themselves ACTIVE instead of someone else marking them down seems like a good thing to do. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000726#comment-15000726 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1713899 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713899 ] SOLR-7989, SOLR-7569: Ignore this test. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000753#comment-15000753 ] Mark Miller commented on SOLR-7569: --- A better approach is probably for this API to deal with a DOWN but valid leader itself. It should only ever happen due to manually screwing up LIR and if this API is messing with LIR, it should also fix the ramifications. Perhaps the last thing the API should do is run through each shard and see if the registered leader is DOWN, and if it is make it ACTIVE (preferably by asking it to publish itself as ACTIVE - we don't want to publish for someone else). If the call waits around to make sure all the leaders come up, this should be simple. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000722#comment-15000722 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1713898 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1713898 ] SOLR-7989, SOLR-7569: Ignore this test. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000797#comment-15000797 ] Mark Miller commented on SOLR-7569: --- bq. It should only ever happen due to manually screwing up LIR and if this API is messing with LIR Down the road though, we will want to solve this for SOLR-7034 and SOLR-7065. I've taken a crack at making SOLR-7989 work. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul > Labels: difficulty-medium, impact-high > Fix For: 5.4, Trunk > > Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991913#comment-14991913 ] Mark Miller commented on SOLR-7569: --- I'm kind of split on where it should go. For something simple and brute force like this, CollectionsHandler is probably fine. Either way seems ok. I wouldn't really worry about it being async if it stays in CollectionsHandler. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992322#comment-14992322 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1712854 from [~noble.paul] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712854 ] SOLR-7569: A collection API called FORCELEADER when all replicas in a shard are down > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992304#comment-14992304 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1712851 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1712851 ] SOLR-7569: A collection API called FORCELEADER when all replicas in a shard are down > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992310#comment-14992310 ] ASF subversion and git services commented on SOLR-7569: --- Commit 1712852 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1712852 ] SOLR-7569: changed message > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989439#comment-14989439 ] Ishan Chattopadhyaya commented on SOLR-7569: One down side of not having something like OVERRIDELASTPUBLISHED is that in the test, I couldn't set the last published to DOWN and check if it was set back to ACTIVE by the FORCELEADER. In this updated patch with FORCEPREPAREFORLEADERSHIP, the test has no easy way of setting the last published to down before the API command is called. Not a deal breaker, but just putting it out there. I'm personally fine either ways (or if there's another name that is more suitable). > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989269#comment-14989269 ] Noble Paul commented on SOLR-7569: -- Let's not keep the core admin command as OVERRIDELASTPUBLISHED. This means it can be a generic enough API which may be abused by others for other things. Let's not tell others what we are doing internally and keep the command name opaque This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980440#comment-14980440 ] Ishan Chattopadhyaya commented on SOLR-7569: bq. It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader? It seems to me that there's no easy way to set the last published state of a replica without the replicas doing it themselves. Do you think we should be doing that instead of marking them as active? Or do you think that just clearing the LIR is enough? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980581#comment-14980581 ] Mark Miller commented on SOLR-7569: --- There are two main things I think that prevent replicas from becoming a leader - if there last published state on the clouddescriptor is not ACTIVE or LIR. I thought we would want to clear LIR and perhaps add an ADMIN command that will set the last published state on the clouddescriptor to ACTIVE for each replica. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971221#comment-14971221 ] Shalin Shekhar Mangar commented on SOLR-7569: - Thanks Ishan but I think you missed the test in your latest patch? Its size has decreased from 36kb to 8kb. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971226#comment-14971226 ] Mark Miller commented on SOLR-7569: --- It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971223#comment-14971223 ] Mark Miller commented on SOLR-7569: --- bq. // Marking all live nodes as active. We do we do this manually like this? Shouldn't we allow this to happen naturally? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971406#comment-14971406 ] Shalin Shekhar Mangar commented on SOLR-7569: - bq. It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader? Do you mean that removing blockers like LIR is enough? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964979#comment-14964979 ] Shalin Shekhar Mangar commented on SOLR-7569: - Thanks Ishan. # ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough? # Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state. # Can you please add some javadocs on the various test methods describing the scenario that they are test? # minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages. # In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO. {code} // Clear out any LIR state String lirPath = overseer.getZkController().getLeaderInitiatedRecoveryZnodePath(collection, sliceId); if (lirPath != null && zkStateReader.getZkClient().exists(lirPath, true)) { StringBuilder sb = new StringBuilder(); zkStateReader.getZkClient().printLayout(lirPath, 4, sb); log.info("Cleaning out LIR data, which was: " + sb); zkStateReader.getZkClient().clean(lirPath); } {code} # There's no need to send an empty string as the role while publishing the state of the replica. # minor nit - you can compare enums directly using == instead of .equals # Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time? {code} // If we still don't have an active leader by now, it maybe possible that the replica at the head of the election queue // was the leader at some point and never left the queue, but got marked as down. So, if the election queue is not empty, // and the replica at the head of the queue is live, then mark it as a leader. {code} > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876938#comment-14876938 ] Ishan Chattopadhyaya commented on SOLR-7569: bq. what happened to the idea of allowing the user to pick the leader as part of the recover shard request? I had a look at your patch for SOLR-6236, I think we can tackle this using that approach. At this point, I'm inclined to keep this patch at this and tackle it separately. Most likely, the system will pick a reasonable leader, and it will sync with other replicas and the shard will be restored. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877089#comment-14877089 ] Mark Miller commented on SOLR-7569: --- I wonder if Recover is the right terminology. It seems so broad and "fix anything" like. Perhaps it should be something close to 'forceleader' - something that is specific about what is happening and gives an idea that you are overriding the system as you are. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877101#comment-14877101 ] Ishan Chattopadhyaya commented on SOLR-7569: I had the same dilemma while naming this. Recover does seem like it will fix things if anything is broken, which can be misleading since at this time we aren't doing anything other than helping fix the LIR state to bring the shard back up. On the other hand, I am not sure about force leader, because we aren't really forcing a leader, but just paving things for an election to happen. I'm really not totally sure either way. How about keeping this as recover shard, documenting this as an advanced API which can potentially cause data loss, and then later add whatever else we need to recover the system from to this API itself? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877114#comment-14877114 ] Mark Miller commented on SOLR-7569: --- That sounds reasonable as long as we have good doc warning about it. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877117#comment-14877117 ] Mark Miller commented on SOLR-7569: --- bq. because we aren't really forcing a leader, but just paving things for an election to happen. I guess it comes down to how you want to think about. When you use this, it will be because the system is blocking a leader from taking over. By running this API command, you remove the blocks, thus 'forcing' a leader the system would not normally pick - or at least attempting to force a leader the system would not really pick. It depends on if you want to get bogged down in implementation or design. I think your proposal is fine though. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877156#comment-14877156 ] Timothy Potter commented on SOLR-7569: -- +1 on FORCE_LEADER for the name of the action. bq. was stuck with "address already in use" exception You should have access to the SocketProxy if you need to close it down before trying to restart the original leader's Jetty. If not, we should fix that. bq. I'm inclined to keep this patch at this and tackle it separately sounds good ... may not ever be needed in the wild with the solution you've created here ;-) > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740864#comment-14740864 ] Mark Miller commented on SOLR-7569: --- bq. what happened to the idea of allowing the user to pick the leader as part of the recover shard request? As long as it's optional and documented so that users understand the risks, it's probably okay. But, I think in most cases the system will beat most users in most cases in understanding who should really be the leader. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740684#comment-14740684 ] Timothy Potter commented on SOLR-7569: -- Looks good Ishan. Sorry for the delay getting a review done. In putNonLeadersIntoLIR, you probably want to wait a little bit before killing the leader after sending doc #2 to give the leader time to put the replicas into LIR; this works quickly on our local workstations but can take a little more time on Jenkins. I'm also wondering if you should bring the original downed leader back into the mix (the one that got killed in the putNonLeadersIntoLIR method) in the testReplicasInLIRNoLeader test after the new leader is selected and see what state it comes back to. Also, try sending another doc #5 once the Jetty hosting the original leader is back online. Lastly, what happened to the idea of allowing the user to pick the leader as part of the recover shard request? I read the comments above and agree that just triggering a re-election is preferred, but sometimes us humans actually know which replica is best. It seems reasonable to me to accept an optional parameter that specifies the replica that should be selected. However, if others don't like that idea, then I'm fine with this for now. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729291#comment-14729291 ] Timothy Potter commented on SOLR-7569: -- Hi, will dig into this in detail later today, sorry for the delay (been on another project ) ... > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727740#comment-14727740 ] Ishan Chattopadhyaya commented on SOLR-7569: bq. At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader. I think the difference is that in this issue, we're just trying to (manually) clean up the LIR state and mark affected down replicas as active and hope that normal leader election is initiated and normalcy is restored. Based on initial glance at SOLR-6236, it seems that the intention there is to actually force one of the replicas to become a leader (either manually or automatically). I am not sure which path we should take, but it seems the approach taken here is less intrusive/safer, if/when it works. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727720#comment-14727720 ] Mark Miller commented on SOLR-7569: --- This seems to have some overlap with SOLR-6236 based on the comments. Tim did some work here as well: {quote}At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader. So this can be considered an optional availablity-over-consistency mode with respect to leader-failover.{quote} > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727549#comment-14727549 ] Ishan Chattopadhyaya commented on SOLR-7569: bq. 1. nit - RecoverShardTest has an unused notLeader1 variable Thanks. Made some refactoring to the test and this has gone away now. bq. 2.Shouldn't the "Wait for a long time for a steady state" piece of code be before the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'. This makes sense, I've made the change. bq. 3.The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen I didn't know about this force update of the cluster state; I've now added it. bq. 4.nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased The tests were (and still are) taking too long, and reducing the wait from 30sec to 1sec was helpful. bq. 5.The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as: Thanks, I've cleaned this up. bq. 6.Can you please write a test to ensure that this API works with 'async' parameter? TODO. bq.Leader is live but 'down' -> mark it 'active' This works now. Added testLeaderDown() method. bq.Leader itself is in LIR -> delete the LIR node This should work, since the API method first clears the LIR state. Couldn't add a test for this, since I couldn't simulate this state in a test. bq.Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active' This works now. Added testAllReplicasDownNoLeader() method. bq.Leader is not live: Replicas are live but in LIR -> delete the LIR nodes This works as last patch. The corresponding test is now at testReplicasInLIRNoLeader(). bq. Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there? Added SOLR-7989 for this, will look deeper soon. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727757#comment-14727757 ] Mark Miller commented on SOLR-7569: --- [~thelabdude], what is your impression? > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727760#comment-14727760 ] Ishan Chattopadhyaya commented on SOLR-7569: bq. This seems to have some overlap with SOLR-6236 based on the comments. I think I should've posted the patches there, instead of here. Seems like I'm trying to solve the same problem here (albeit, in a different way). > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727759#comment-14727759 ] Mark Miller commented on SOLR-7569: --- bq. we're just trying to (manually) clean up the LIR state and mark affected down replicas as active and hope that normal leader election is initiated and normalcy is restored. I like that approach too, but I want to make sure we consider SOLR-6236. > Create an API to force a leader election between nodes > -- > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716934#comment-14716934 ] Shalin Shekhar Mangar commented on SOLR-7569: - Thanks Ishan! A few comments: # nit - RecoverShardTest has an unused notLeader1 variable # Shouldn't the Wait for a long time for a steady state piece of code be *before* the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'. # The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen # nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased # The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as: {code} ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerAction.LEADER.toLower(), ZkStateReader.SHARD_ID_PROP, shardId, ZkStateReader.COLLECTION_PROP, collection); Overseer.getInQueue(zkClient).offer(Utils.toJSON(m)); {code} # Can you please write a test to ensure that this API works with 'async' parameter? I think some simple scenarios are not being taken care of. This command only helps if there a LIR node exists but we can do a bit more: # Leader is live but 'down' - mark it 'active' # Leader itself is in LIR - delete the LIR node # Leader is not live: ## Replicas are live but 'down' or 'recovering' - mark them 'active' ## Replicas are live but in LIR - delete the LIR nodes Can you please add some tests exercising each of the above scenarios? bq. I also tried to mark just one of the replicas as active instead of all the replicas, hoping it will become leader and others would recover from it. However, this resulted in one of the other down replicas becoming leader but still staying down. Looking into why that could be happening; bug? Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there? Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709330#comment-14709330 ] Ishan Chattopadhyaya commented on SOLR-7569: I just tried to simulate the scenario where all the replicas are in down state due to LIR, and there is no leader. In this state, the leader election queue is empty. So, I am thinking of some way to have the replicas (that are on live nodes) to join the leader election. Is there any clean way of doing that, short of a core reload? Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709403#comment-14709403 ] Ishan Chattopadhyaya commented on SOLR-7569: In this state, the leader election queue is empty. Ignore that, I was catching that state before the replicas had a chance to rejoin the election. The last assert in the patch is inappropriate. Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706723#comment-14706723 ] Mark Miller commented on SOLR-7569: --- I don't really like the idea of choosing a leader. It seems to me this feature should force a new election and address the state that prevents someone from becoming leader somehow. You still want the sync stage and the system to pick the best leader though. This should just get you out of the state that is preventing a leader from being elected. Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch, SOLR-7569.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706311#comment-14706311 ] Varun Thacker commented on SOLR-7569: - bq. Pick the next leader: If the leader election queue is not empty and the first replica in the queue is on a live node, choose the replica as the next leader. Otherwise, pick a random replica, which is on a live node, to become the next leader (TODO: we can have the user specify which replica he/she wants as the next leader). Maybe pick the leader amongst the replicas which has the latest commit timestamp? Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch, SOLR-7569.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701387#comment-14701387 ] Ishan Chattopadhyaya commented on SOLR-7569: Thanks [~markrmil...@gmail.com] for the pointer to the issues. I agree this is a sledge hammer to undo the effects of bugs, which shouldn't be needed if we go by an improved design. We have observed the effects of these bugs in production clusters of our clients, and this is to help them in such a scenario. Do you think we should continue down this sledge hammer path, parallel to fixing the bugs? Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701392#comment-14701392 ] Mark Miller commented on SOLR-7569: --- Yes, I think having this option is useful in the short term and in the longer term. The system will generally refuse to continue on if it thinks it may have data loss and stopping could allow a user to possibly recover that data. This could act as a way for a user to override that. Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701327#comment-14701327 ] Mark Miller commented on SOLR-7569: --- bq. maybe due to bugs? The current design allows for this. See SOLR-7034 and SOLR-7065 as possible improvement steps. This is kind of a hack solution to a current production problem or to force a leader election even when we know it probably means data loss, those issues are closer to what is supposed to come next in terms of improving the current design. I may have also just seen a bug where LIR info in ZK prevents anyone from becoming the leader even on full restart. SOLR-7065 should address those kinds of bugs if done right. Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes
[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701503#comment-14701503 ] Erick Erickson commented on SOLR-7569: -- bq: If the chosen leader is not at the head of the leader election queue, have it join the election at the head (similar to what REBALANCELEADERS tries to do). Be _really_ careful if you are trying to manipulate the leader election stuff, it's very easy to get wrong. Or at least it was last time I looked, perhaps it's changed a lot since then. I'd be glad to chat about what I remember if you'd like. Create an API to force a leader election between nodes -- Key: SOLR-7569 URL: https://issues.apache.org/jira/browse/SOLR-7569 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Shalin Shekhar Mangar Labels: difficulty-medium, impact-high Attachments: SOLR-7569.patch There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability. This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org