[ 
https://issues.apache.org/jira/browse/SOLR-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103645#comment-15103645
 ] 

Shai Erera commented on SOLR-7844:
----------------------------------

[[email protected]] this seems to break upgrading existing 5x (e.g. 5.3) 
clusters to 5.4, unless I missed a "migration" step. If you're doing a rolling 
upgrade, such that you take one of the nodes down, replace the JARs to 5.4 and 
restart the node, you'll see such exceptions:

{noformat}
org.apache.solr.common.SolrException: Error getting leader from zk for shard 
shard1
    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1034)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:940)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:883)
    at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:184)
    at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:213)
    at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:696)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:750)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:716)
    at 
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:623)
    at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:204)
    at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:184)
    at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
    at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
        ...
Caused by: org.apache.solr.common.SolrException: Could not get leader props
    at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1081)
    at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1045)
    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1001)
    ... 35 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/acg-test-1/leaders/shard1/leader
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
    at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
    at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1059)
{noformat}

When the 5.4 nodes come up, they don't find {{/collections/coll/shard/leader1}} 
path and fail. I am not quite sure how to recover this though, since the 
cluster has a mixture of 5.3 and 5.4 nodes. I cannot create 
{{.../shard1/leader}} since {{../shard1}} is an EPHEMERAL node and therefore 
can't create child nodes. I am not sure what will happen if I delete 
"../shard1" and recreate it as non EPHEMERAL, will the old 5.3 nodes work? I 
also need to ensure that the new 5.4 node doesn't become the leader if it 
wasn't already.

Perhaps a fix would be for 5.4 to fallback to read the leader info from 
"../shard1"? Then when the last 5.3 node is down, the leader will be attempted 
by a 5.4 node which will recreate the leader path according to the 5.4 format? 
Should this have been a zk version change?

I'd appreciate some guidance here.

> Zookeeper session expiry during shard leader election can cause multiple 
> leaders.
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-7844
>                 URL: https://issues.apache.org/jira/browse/SOLR-7844
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10.4
>            Reporter: Mike Roberts
>            Assignee: Mark Miller
>             Fix For: 5.4, Trunk
>
>         Attachments: SOLR-7844-5x.patch, SOLR-7844.patch, SOLR-7844.patch, 
> SOLR-7844.patch, SOLR-7844.patch, SOLR-7844.patch, SOLR-7844.patch, 
> SOLR-7844.patch, SOLR-7844.patch, SOLR-7844.patch
>
>
> If the ZooKeeper session expires for a host during shard leader election, the 
> ephemeral leader_elect nodes are removed. However the threads that were 
> processing the election are still present (and could believe the host won the 
> election). They will then incorrectly create leader nodes once a new 
> ZooKeeper session is established.
> This introduces a subtle race condition that could cause two hosts to become 
> leader.
> Scenario:
> a three machine cluster, all of the machines are restarting at approximately 
> the same time.
> The first machine starts, writes a leader_elect ephemeral node, it's the only 
> candidate in the election so it wins and starts the leadership process. As it 
> knows it has peers, it begins to block waiting for the peers to arrive.
> During this period of blocking[1] the ZK connection drops and the session 
> expires.
> A new ZK session is established, and ElectionContext.cancelElection is 
> called. Then register() is called and a new set of leader_elect ephemeral 
> nodes are created.
> During the period between the ZK session expiring, and new set of 
> leader_elect nodes being created the second machine starts.
> It creates its leader_elect ephemeral nodes, as there are no other nodes it 
> wins the election and starts the leadership process. As its still missing one 
> of its peers, it begins to block waiting for the third machine to join.
> There is now a race between machine1 & machine2, both of whom think they are 
> the leader.
> So far, this isn't too bad, because the machine that loses the race will fail 
> when it tries to create the /collection/name/leader/shard1 node (as it 
> already exists), and will rejoin the election.
> While this is happening, machine3 has started and has queued for leadership 
> behind machine2.
> If the loser of the race is machine2, when it rejoins the election it cancels 
> the current context, deleting it's leader_elect ephemeral nodes.
> At this point, machine3 believes it has become leader (the watcher it has on 
> the leader_elect node fires), and it runs the LeaderElector::checkIfIAmLeader 
> method. This method DELETES the current /collection/name/leader/shard1 node, 
> then starts the leadership process (as all three machines are now running, it 
> does not block to wait).
> So, machine1 won the race with machine2 and declared its leadership and 
> created the nodes. However, machine3 has just deleted them, and recreated 
> them for itself. So machine1 and machine3 both believe they are the leader.
> I am thinking that the fix should be to cancel & close all election contexts 
> immediately on reconnect (we do cancel them, however it's run serially which 
> has blocking issues, and just canceling does not cause the wait loop to 
> exit). That election context logic already has checks on the closed flag, so 
> they should exit if they see it has been closed.
> I'm working on a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to