Re: 6.4.0 collection leader election and recovery issues
Thanks Shawn. Yes I did index some docs after moving to 6.4.0. The release notes did not mention anything about format being changed so I thought it would be backward compatible. Yeah my only recourse is to re-index data. Apart from that it was weird problems overall with 6.4.0. I was excited about using the unified highlighter but the zookeeper flakiness and constant disconnections of solr and sometimes not electing a leader for some collections made me rollback. Anyway thanks for promptly responding, will be more careful form next time. Thanks Ravi Kiran Bhaskar On Thu, Feb 2, 2017 at 9:41 AM, Shawn Heiseywrote: > On 2/2/2017 7:23 AM, Ravi Solr wrote: > > When i try to rollback from 6.4.0 to my original version of 6.0.1 it now > > throws another issue. Now I cant go to 6.4.0 nor can I roll back to 6.0.1 > > > > Could not load codec 'Lucene62'. Did you forget to add > > lucene-backward-codecs.jar? > > at org.apache.lucene.index.SegmentInfos.readCodec( > SegmentInfos.java:429) > > at > > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:349) > > at > > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284) > > > > Hope this doesnt cost me dearly. Any ideas at least on how to rollback > > safely. > > This sounds like you did some indexing after the upgrade, or possibly > some index optimizing, so the parts of the index that were written (or > merged) by the newer version are now in a format that the older version > cannot use. Perhaps the merge policy was changed, causing Solr to do > some automatic merges once it started up. I am not aware of anything in > Solr that would write new segments without indexing input or a merge > policy change. > > As far as I know, there is no straightforward way to go backwards with > the index format. If you want to downgrade and don't have a backup of > your indexes from before the upgrade, you'll probably need to wipe the > index directory and completely reindex. > > Solr will always use the newest default index format for new segments > when you upgrade. Contrary to many user expectations, setting > luceneMatchVersion will *NOT* affect the index format, only the behavior > of components that do field analysis. > > Downgrading the index format would involve writing a custom Lucene > program that changes the active index format to the older version, then > runs a forceMerge on the index. It would be completely separate from > Solr, and definitely not straightforward. > > Thanks, > Shawn > >
Re: 6.4.0 collection leader election and recovery issues
On 2/2/2017 7:23 AM, Ravi Solr wrote: > When i try to rollback from 6.4.0 to my original version of 6.0.1 it now > throws another issue. Now I cant go to 6.4.0 nor can I roll back to 6.0.1 > > Could not load codec 'Lucene62'. Did you forget to add > lucene-backward-codecs.jar? > at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:429) > at > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:349) > at > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284) > > Hope this doesnt cost me dearly. Any ideas at least on how to rollback > safely. This sounds like you did some indexing after the upgrade, or possibly some index optimizing, so the parts of the index that were written (or merged) by the newer version are now in a format that the older version cannot use. Perhaps the merge policy was changed, causing Solr to do some automatic merges once it started up. I am not aware of anything in Solr that would write new segments without indexing input or a merge policy change. As far as I know, there is no straightforward way to go backwards with the index format. If you want to downgrade and don't have a backup of your indexes from before the upgrade, you'll probably need to wipe the index directory and completely reindex. Solr will always use the newest default index format for new segments when you upgrade. Contrary to many user expectations, setting luceneMatchVersion will *NOT* affect the index format, only the behavior of components that do field analysis. Downgrading the index format would involve writing a custom Lucene program that changes the active index format to the older version, then runs a forceMerge on the index. It would be completely separate from Solr, and definitely not straightforward. Thanks, Shawn
Re: 6.4.0 collection leader election and recovery issues
Thanks Hendrik. Iam baffled as to why I did not hit this issue prior to moving to 6.4.0. On Thu, Feb 2, 2017 at 7:58 AM, Hendrik Haddorpwrote: > Might be that your overseer queue overloaded. Similar to what is described > here: > https://support.lucidworks.com/hc/en-us/articles/203959903- > Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up > > If the overseer queue gets too long you get hit by this: > https://github.com/Netflix/curator/wiki/Tech-Note-4 > > Try to request the overseer status > (/solr/admin/collections?action=OVERSEERSTATUS). > If that fails you likely hit that problem. If so you can also not use the > ZooKeeper command line client anymore. You can now restart all your ZK > nodes with an increases jute.maxbuffer value. Once ZK is restarted you can > use the ZK command line client with the same jute.maxbuffer value and check > how many entries /overseer/queue has in ZK. Normally there should be a few > entries but if you see thousands then you should delete them. I used a few > lines of Java code for that, again setting jute.maxbuffer to the same > value. Once cleaned up restart the Solr nodes one by one and keep an eye on > the overseer status. > > > On 02.02.2017 10:52, Ravi Solr wrote: > >> Following up on my previous email, the intermittent server unavailability >> seems to be linked to the interaction between Solr and Zookeeper. Can >> somebody help me understand what this error means and how to recover from >> it. >> >> 2017-02-02 09:44:24.648 ERROR >> (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr >> x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3) >> [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4] >> o.a.s.c.RecoveryStrategy Error while trying to recover. >> core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperE >> xception$SessionExpiredException: >> KeeperErrorCode = Session expired for /overseer/queue/qn- >> at org.apache.zookeeper.KeeperException.create(KeeperException. >> java:127) >> at org.apache.zookeeper.KeeperException.create(KeeperException. >> java:51) >> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) >> at >> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl >> ient.java:391) >> at >> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl >> ient.java:388) >> at >> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk >> CmdExecutor.java:60) >> at >> org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388) >> at >> org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244) >> at org.apache.solr.cloud.ZkController.publish(ZkController. >> java:1215) >> at org.apache.solr.cloud.ZkController.publish(ZkController. >> java:1128) >> at org.apache.solr.cloud.ZkController.publish(ZkController. >> java:1124) >> at >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoverySt >> rategy.java:334) >> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy. >> java:222) >> at >> com.codahale.metrics.InstrumentedExecutorService$Instrumente >> dRunnable.run(InstrumentedExecutorService.java:176) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE >> xecutor.lambda$execute$0(ExecutorUtil.java:229) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> Thanks >> >> Ravi Kiran Bhaskar >> >> On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr wrote: >> >> Hello, >>> Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12 >>> hours of debugging spree!! Can somebody kindly help me out of this >>> misery. >>> >>> I have a set has 8 single shard collections with 3 replicas. As soon as I >>> updated the configs and started the servers one of my collection got >>> stuck >>> with no leader. I have restarted solr to no avail, I also tried to force >>> a >>> leader via collections API that dint work either. I also see that, from >>> time to time multiple solr nodes go down all at the same time, only a >>> restart resolves the issue. >>> >>> The error snippets are shown below >>> >>> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n: >>> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1 >>> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1 >>> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying >>> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common. >>> SolrException: >>> No registered leader was found after waiting for 4000ms , collection: >>> clicktrack slice: shard1 >>> >>>
Re: 6.4.0 collection leader election and recovery issues
When i try to rollback from 6.4.0 to my original version of 6.0.1 it now throws another issue. Now I cant go to 6.4.0 nor can I roll back to 6.0.1 Could not load codec 'Lucene62'. Did you forget to add lucene-backward-codecs.jar? at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:429) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:349) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284) Hope this doesnt cost me dearly. Any ideas at least on how to rollback safely. Thanks Ravi Kiran Bhaskar On Thu, Feb 2, 2017 at 4:52 AM, Ravi Solrwrote: > Following up on my previous email, the intermittent server unavailability > seems to be linked to the interaction between Solr and Zookeeper. Can > somebody help me understand what this error means and how to recover from > it. > > 2017-02-02 09:44:24.648 ERROR (recoveryExecutor-3-thread-16- > processing-n:xx.xxx.xxx.xxx:1234_solr x:clicktrack_shard1_replica4 > s:shard1 c:clicktrack r:core_node3) [c:clicktrack s:shard1 r:core_node3 > x:clicktrack_shard1_replica4] o.a.s.c.RecoveryStrategy Error while trying > to recover. core=clicktrack_shard1_replica4:org.apache.zookeeper. > KeeperException$SessionExpiredException: KeeperErrorCode = Session > expired for /overseer/queue/qn- > at org.apache.zookeeper.KeeperException.create( > KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create( > KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at org.apache.solr.common.cloud.SolrZkClient$9.execute( > SolrZkClient.java:391) > at org.apache.solr.common.cloud.SolrZkClient$9.execute( > SolrZkClient.java:388) > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation( > ZkCmdExecutor.java:60) > at org.apache.solr.common.cloud.SolrZkClient.create( > SolrZkClient.java:388) > at org.apache.solr.cloud.DistributedQueue.offer( > DistributedQueue.java:244) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124) > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:334) > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:222) > at com.codahale.metrics.InstrumentedExecutorService$ > InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at org.apache.solr.common.util.ExecutorUtil$ > MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Thanks > > Ravi Kiran Bhaskar > > On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr wrote: > >> Hello, >> Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12 >> hours of debugging spree!! Can somebody kindly help me out of this misery. >> >> I have a set has 8 single shard collections with 3 replicas. As soon as I >> updated the configs and started the servers one of my collection got stuck >> with no leader. I have restarted solr to no avail, I also tried to force a >> leader via collections API that dint work either. I also see that, from >> time to time multiple solr nodes go down all at the same time, only a >> restart resolves the issue. >> >> The error snippets are shown below >> >> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n: >> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1 >> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1 >> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying >> to recover. >> core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException: >> No registered leader was found after waiting for 4000ms , collection: >> clicktrack slice: shard1 >> >> solr.log.9:2017-02-02 01:43:41.336 INFO (zkCallback-4-thread-29-proces >> sing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster >> state change: [WatchedEvent state:SyncConnected type:NodeDataChanged >> path:/collections/clicktrack/state.json] for collection [clicktrack] has >> occurred - updating... (live nodes size: [1]) >> solr.log.9:2017-02-02 01:43:42.224 INFO (zkCallback-4-thread-29-proces >> sing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster >> state change: [WatchedEvent state:SyncConnected type:NodeDataChanged >> path:/collections/clicktrack/state.json] for collection [clicktrack] has >> occurred - updating... (live nodes size: [1]) >> solr.log.9:2017-02-02 01:43:43.767 INFO
Re: 6.4.0 collection leader election and recovery issues
Might be that your overseer queue overloaded. Similar to what is described here: https://support.lucidworks.com/hc/en-us/articles/203959903-Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up If the overseer queue gets too long you get hit by this: https://github.com/Netflix/curator/wiki/Tech-Note-4 Try to request the overseer status (/solr/admin/collections?action=OVERSEERSTATUS). If that fails you likely hit that problem. If so you can also not use the ZooKeeper command line client anymore. You can now restart all your ZK nodes with an increases jute.maxbuffer value. Once ZK is restarted you can use the ZK command line client with the same jute.maxbuffer value and check how many entries /overseer/queue has in ZK. Normally there should be a few entries but if you see thousands then you should delete them. I used a few lines of Java code for that, again setting jute.maxbuffer to the same value. Once cleaned up restart the Solr nodes one by one and keep an eye on the overseer status. On 02.02.2017 10:52, Ravi Solr wrote: Following up on my previous email, the intermittent server unavailability seems to be linked to the interaction between Solr and Zookeeper. Can somebody help me understand what this error means and how to recover from it. 2017-02-02 09:44:24.648 ERROR (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3) [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4] o.a.s.c.RecoveryStrategy Error while trying to recover. core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue/qn- at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391) at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388) at org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Thanks Ravi Kiran Bhaskar On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solrwrote: Hello, Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12 hours of debugging spree!! Can somebody kindly help me out of this misery. I have a set has 8 single shard collections with 3 replicas. As soon as I updated the configs and started the servers one of my collection got stuck with no leader. I have restarted solr to no avail, I also tried to force a leader via collections API that dint work either. I also see that, from time to time multiple solr nodes go down all at the same time, only a restart resolves the issue. The error snippets are shown below 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n: 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1 c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1 x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: clicktrack slice: shard1 solr.log.9:2017-02-02 01:43:41.336 INFO (zkCallback-4-thread-29- processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/clicktrack/state.json] for collection [clicktrack] has occurred - updating... (live nodes size: [1]) solr.log.9:2017-02-02 01:43:42.224 INFO (zkCallback-4-thread-29- processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent
Re: 6.4.0 collection leader election and recovery issues
Following up on my previous email, the intermittent server unavailability seems to be linked to the interaction between Solr and Zookeeper. Can somebody help me understand what this error means and how to recover from it. 2017-02-02 09:44:24.648 ERROR (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3) [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4] o.a.s.c.RecoveryStrategy Error while trying to recover. core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue/qn- at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391) at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388) at org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Thanks Ravi Kiran Bhaskar On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solrwrote: > Hello, > Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12 > hours of debugging spree!! Can somebody kindly help me out of this misery. > > I have a set has 8 single shard collections with 3 replicas. As soon as I > updated the configs and started the servers one of my collection got stuck > with no leader. I have restarted solr to no avail, I also tried to force a > leader via collections API that dint work either. I also see that, from > time to time multiple solr nodes go down all at the same time, only a > restart resolves the issue. > > The error snippets are shown below > > 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n: > 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1 > c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1 > x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying > to recover. > core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException: > No registered leader was found after waiting for 4000ms , collection: > clicktrack slice: shard1 > > solr.log.9:2017-02-02 01:43:41.336 INFO (zkCallback-4-thread-29- > processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A > cluster state change: [WatchedEvent state:SyncConnected > type:NodeDataChanged path:/collections/clicktrack/state.json] for > collection [clicktrack] has occurred - updating... (live nodes size: [1]) > solr.log.9:2017-02-02 01:43:42.224 INFO (zkCallback-4-thread-29- > processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A > cluster state change: [WatchedEvent state:SyncConnected > type:NodeDataChanged path:/collections/clicktrack/state.json] for > collection [clicktrack] has occurred - updating... (live nodes size: [1]) > solr.log.9:2017-02-02 01:43:43.767 INFO (zkCallback-4-thread-23- > processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A > cluster state change: [WatchedEvent state:SyncConnected > type:NodeDataChanged path:/collections/clicktrack/state.json] for > collection [clicktrack] has occurred - updating... (live nodes size: [1]) > > > Suspecting the worst I backed up the index and renamed the collection's > data folder and restarted the servers, this time the collection got a > proper leader. So is my index really corrupted ? Solr UI showed live nodes > just like the logs but without any leader. Even with the leader issue > somewhat alleviated after renaming the data folder and letting silr create > a new data folder my servers did go down a couple of times. > > I am not all that well versed with zookeeper...any trick to make zookeeper > pick
6.4.0 collection leader election and recovery issues
Hello, Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12 hours of debugging spree!! Can somebody kindly help me out of this misery. I have a set has 8 single shard collections with 3 replicas. As soon as I updated the configs and started the servers one of my collection got stuck with no leader. I have restarted solr to no avail, I also tried to force a leader via collections API that dint work either. I also see that, from time to time multiple solr nodes go down all at the same time, only a restart resolves the issue. The error snippets are shown below 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1 c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1 x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: clicktrack slice: shard1 solr.log.9:2017-02-02 01:43:41.336 INFO (zkCallback-4-thread-29-processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/clicktrack/state.json] for collection [clicktrack] has occurred - updating... (live nodes size: [1]) solr.log.9:2017-02-02 01:43:42.224 INFO (zkCallback-4-thread-29-processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/clicktrack/state.json] for collection [clicktrack] has occurred - updating... (live nodes size: [1]) solr.log.9:2017-02-02 01:43:43.767 INFO (zkCallback-4-thread-23-processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/clicktrack/state.json] for collection [clicktrack] has occurred - updating... (live nodes size: [1]) Suspecting the worst I backed up the index and renamed the collection's data folder and restarted the servers, this time the collection got a proper leader. So is my index really corrupted ? Solr UI showed live nodes just like the logs but without any leader. Even with the leader issue somewhat alleviated after renaming the data folder and letting silr create a new data folder my servers did go down a couple of times. I am not all that well versed with zookeeper...any trick to make zookeeper pick a leader and be happy ? Did anybody have solr/zookeeper issues with 6.4.0 ? Thanks Ravi Kiran Bhaskar