[
https://issues.apache.org/jira/browse/SOLR-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973533#comment-13973533
]
Hoss Man commented on SOLR-5991:
--------------------------------
Off the cuff: it sounds like, what you'd really want for these types of
usecases, is:
1) an "AVOID_RESPONSIBILITY" role which tells a node it should never
participate in elections -- either for shard leader, or for overseer.
2) per-node status info (from /admin/system) about whether this node is the
overseer (SOLR-5823) and/or hosts the leader of any shard
3) a "forceelection" Collection API action (that takes an optional collection
name and shard name - so it can force overseer election, or leader election of
all shards, or leader election of a specific shard)
4) logic in CoreContainer.shutdown() that causes the node to do the following
before finishing a clean shutdown:
* act as if it has the AVOID_RESPONSIBILITY role (w/o updating it's actual zk
state) until completion of shutdown
* loop over it's current responsibilities and self-trigger the necessary
"forceelection" commands to elect someone else to take it's place sa
overseer/shard-leader(s)
So...
* if you just want to reboot one node - you reboot that node, and instead of
just acting like it's droped off the face of the earth and potentially
triggering elections when the ZK epheeral nodes vanish, it poactively
encourages an election first.
* If you want to shut down N machines permanently: you assign all of those N
machines the role "AVOID_RESPONSIBILITY" in advance, and then iterate over them
shutting them down. Ones that had no responsibilities to begin with will
shutdown fast, nodes that did have responsibilities will shutdown slower as
they force elections - but none of the other machines you are about to shutdown
will take on those responsibilities.
* If you want to reboot N machines with minimal down time: you can iterate over
your N machines checking their /admin/system response to see if they are the
overseer or a shard leader -- if they are, you trigger the neccessary
action=forceelection commands and wait for them to complete. when you are
done, you should be able to shutdown/restart all N nodes very quickly, and then
remove the "AVOID_RESPONSIBILITY" role at your lesuire.
> SolrCloud: Add API to move leader off a Solr instance
> -----------------------------------------------------
>
> Key: SOLR-5991
> URL: https://issues.apache.org/jira/browse/SOLR-5991
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.7.1
> Reporter: Rich Mayfield
>
> Common maintenance chores require restarting Solr instances.
> The process of a shutdown becomes a whole lot more reliable if we can
> proactively move any leadership roles off of the Solr instance we are going
> to shut down. The leadership election process then runs immediately.
> I am not sure what the semantics should be (either accomplishes the goal but
> one of these might be best):
> * A call to tell a core to give up leadership (thus the next replica is
> chosen)
> * A call to specify which core should become the leader
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]