[ https://issues.apache.org/jira/browse/SOLR-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922711#comment-13922711 ]
Timothy Potter commented on SOLR-5796: -------------------------------------- I'm still a little concern about a couple of things: 1) why is the leader "state" stored in two places in ZooKeeper (/clusterstate.json and /collections/<coll>/leaders/shard#)? I'm sure there is a good reason for this but can't see why ;-) 2) if the timeout still occurs (as we don't want to wait forever), can't the node with the conflict just favor what's in the leader path assuming that replica is active and agrees? In other words, instead of throwing an exception and then just ending up a "down" state, why can't the replica seeing the conflict just go with what ZooKeeper says? I'm digging into leader failover timing / error handling today. Thanks. Tim > With many collections, leader re-election takes too long when a node dies or > is rebooted, leading to some shards getting into a "conflicting" state about > who is the leader. > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-5796 > URL: https://issues.apache.org/jira/browse/SOLR-5796 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: Found on branch_4x > Reporter: Timothy Potter > Assignee: Mark Miller > Fix For: 4.8, 5.0 > > Attachments: SOLR-5796.patch > > > I'm doing some testing with a 4-node SolrCloud cluster against the latest rev > in branch_4x having many collections, 150 to be exact, each having 4 shards > with rf=3, so 450 cores per node. Nodes are decent in terms of resources: > -Xmx6g with 4 CPU - m3.xlarge's in EC2. > The problem occurs when rebooting one of the nodes, say as part of a rolling > restart of the cluster. If I kill one node and then wait for an extended > period of time, such as 3 minutes, then all of the leaders on the downed node > (roughly 150) have time to failover to another node in the cluster. When I > restart the downed node, since leaders have all failed over successfully, the > new node starts up and all cores assume the replica role in their respective > shards. This is goodness and expected. > However, if I don't wait long enough for the leader failover process to > complete on the other nodes before restarting the downed node, > then some bad things happen. Specifically, when the dust settles, many of the > previous leaders on the node I restarted get stuck in the "conflicting" state > seen in the ZkController, starting around line 852 in branch_4x: > {quote} > 852 while (!leaderUrl.equals(clusterStateLeaderUrl)) { > 853 if (tries == 60) { > 854 throw new SolrException(ErrorCode.SERVER_ERROR, > 855 "There is conflicting information about the leader of > shard: " > 856 + cloudDesc.getShardId() + " our state says:" > 857 + clusterStateLeaderUrl + " but zookeeper says:" + > leaderUrl); > 858 } > 859 Thread.sleep(1000); > 860 tries++; > 861 clusterStateLeaderUrl = zkStateReader.getLeaderUrl(collection, > shardId, > 862 timeoutms); > 863 leaderUrl = getLeaderProps(collection, cloudDesc.getShardId(), > timeoutms) > 864 .getCoreUrl(); > 865 } > {quote} > As you can see, the code is trying to give a little time for this problem to > work itself out, 1 minute to be exact. Unfortunately, that doesn't seem to be > long enough for a busy cluster that has many collections. Now, one might > argue that 450 cores per node is asking too much of Solr, however I think > this points to a bigger issue of the fact that a node coming up isn't aware > that it went down and leader election is running on other nodes and is just > being slow. Moreover, once this problem occurs, it's not clear how to fix it > besides shutting the node down again and waiting for leader failover to > complete. > It's also interesting to me that /clusterstate.json was updated by the > healthy node taking over the leader role but the > /collections/<coll>leaders/shard# was not updated? I added some debugging and > it seems like the overseer queue is extremely backed up with work. > Maybe the solution here is to just wait longer but I also want to get some > feedback from the community on other options? I know there are some plans to > help scale the Overseer (i.e. SOLR-5476) so maybe that helps and I'm trying > to add more debug to see if this is really due to overseer backlog (which I > suspect it is). > In general, I'm a little confused by the keeping of leader state in multiple > places in ZK. Is there any background information on why we have leader state > in /clusterstate.json and in the leader path znode? > Also, here are some interesting side observations: > a. If I use rf=2, then this problem doesn't occur as leader failover happens > more quickly and there's less overseer work? > May be a red herring here, but I can consistently reproduce it with RF=3, but > not with RF=2 ... suppose that is because there are only 300 cores per node > versus 450 and that's just enough less work to make this issue work itself > out. > b. To support that many cores, I had to set -Xss256k to reduce the stack size > as Solr uses a lot of threads during startup (high point was 800'ish) > > Might be something we should recommend on the mailing list / wiki somewhere. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org