[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159525#comment-15159525 ]
Mark Miller commented on SOLR-8697: ----------------------------------- Note: Have not seen any fails like this in this test in a long, long time so may be related to this change. Perhaps just a test issue, because this retries for like 60 seconds or something. {noformat} [junit4] ERROR 66.3s J4 | OverseerTest.testShardLeaderChange <<< [junit4] > Throwable #1: org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed [junit4] > at __randomizedtesting.SeedInfo.seed([C4618609907C7E14:1A3201FE8AE48BE5]:0) [junit4] > at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:212) [junit4] > at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:173) [junit4] > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:138) [junit4] > at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:310) [junit4] > at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:219) [junit4] > at org.apache.solr.cloud.OverseerTest$MockZKController.publishState(OverseerTest.java:181) [junit4] > at org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:841) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists [junit4] > at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) [junit4] > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949) [junit4] > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) [junit4] > at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:577) [junit4] > at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:574) [junit4] > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) [junit4] > at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:574) [junit4] > at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:195) [junit4] > at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:49) [junit4] > at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:42) [junit4] > at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:178) [junit4] > ... 45 more {noformat} > Fix LeaderElector issues > ------------------------ > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.4.1 > Reporter: Scott Blum > Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org