[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

Ishan Chattopadhyaya (JIRA) Thu, 22 Oct 2015 05:27:14 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ishan Chattopadhyaya updated SOLR-7569:
---------------------------------------
    Attachment: SOLR-7569.patch

Thanks Shalin for looking into the patch and your review.

bq. ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't 
waitForRecoveriesToFinish() enough?
Fixed. This was a left over from some previous patch. I think I wanted to put 
the waitForRecoveriesToFinish(), but forgot to remove the 5 second sleep.

bq. Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady 
state to be reached? What is this steady state, is there a better way than 
waiting for an arbitrary amount of time? In general, Thread.sleep should be 
avoided as much as possible as a way to reach steady state.
In this case, waiting those 15 seconds results in one of the down replicas to 
become a leader (but stay down). This is the situation I'm using FORCELEADER to 
recover from. Instead of waiting 15 seconds, I've added some polling with wait 
to wake up earlier if needed, while increasing the timeout from 15s to 25s.


bq. Can you please add some javadocs on the various test methods describing the 
scenario that they are test?
Sure, added.

bq. minor nit - can you use assertEquals when testing equality of state etc 
instead of assertTrue. The advantage with assertEquals is that it logs the 
mismatched values in the exception messages.
Used assertEquals() now.

bq. In OverseerCollectionMessageHandler, lirPath can never be null. The lir 
path should probably be logged in debug rather than INFO.
Thanks for the pointer, I've removed the null check. I feel this should be INFO 
instead of DEBUG, so that if a user says I issued FORCELEADER but still nothing 
worked for him, his logs would help us understand if we ever had any LIR state 
which was cleared out. But, please feel free to remove it if this doesn't make 
sense.

bq. minor nit - you can compare enums directly using == instead of .equals
Fixed.

bq. Referring to the following, what is the thinking behind it? when can this 
happen? is there a test which specifically exercises this scenario? seems like 
this can interfere with the leader election if the leader election was taking 
some time? 

I modified the comment text to make it more clear. This is for the situation 
when all replicas are (somehow, due to bug maybe?) down/recovering (but not in 
LIR), and there is no leader, even though many replicas are on live; I don't 
know if this ever happens (the LIR case happens, I know). The 
testAllReplicasDownNoLeader test exercises this scenario. This is more or less 
the scenario that you described (with one difference that there is no leader as 
well): {{Leader is not live: Replicas are live but 'down' or 'recovering' -> 
mark them 'active'}}.

As you point out, I think it can indeed interfere with any on-going leader 
election; my thought was that this FORCELEADER call is issued only because the 
leader election isn't achieving a stable leader, so force marking the queue 
head replica as leader is okay. But I defer to your judgement if this is fine 
or not, and I can remove (or you feel free to remove) that code path from the 
patch if you feel it is not right.

> Create an API to force a leader election between nodes
> ------------------------------------------------------
>
>                 Key: SOLR-7569
>                 URL: https://issues.apache.org/jira/browse/SOLR-7569
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-high
>         Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

Reply via email to