[
https://issues.apache.org/jira/browse/SOLR-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502672#comment-13502672
]
Werner Maier commented on SOLR-3993:
------------------------------------
To Po: (thanks for your comment)
1) [...]this only one core will be the leader finally. this will take a long
time cause the waitForReplicasComeup() [...]
I don't see a WaitForReplicasComeUp in the logs in that case. Maybe I
missinterpreted the lots of exceptions in the log.
After power fail or kill -9 it shows (for me) a recovery loop. Maybe this loop
will end eventually (I stopped watching
after some 10..15min). I'll try that again if I have some minutes.
2) Zookeeper:
I know. But I'm a sysadmin dealing with real hardware (and failures) - not a
programmer that just uses failure-free hardware and proposes preconditions :)
I have seen a ROW of 19"-cabinets going down - even though each of them had TWO
redundant Power Lines and the compunting center had a N+2 UPS. So I'm just
trying all possible scenarios that I can imagine - and one of theese is a power
outage for zookeeper and solr at the same time. Some things can be controlled
with startup order (first zookeeper then solr) on single machines, but if
multiple machines are involed, this gets difficult.
If such a problem can easily circumvented by the solar instance reconnecting to
the zookeeper, then the solr instance should just do that.
So I tried these things and wrote the bug report - as maybe the developers
(that do a very good job!) might just not have considered these cases.
> SolrCloud leader election on single node stucks the initialization
> ------------------------------------------------------------------
>
> Key: SOLR-3993
> URL: https://issues.apache.org/jira/browse/SOLR-3993
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.0
> Environment: Windows 7, Tomcat 6
> Reporter: Alexey Kudinov
> Assignee: Mark Miller
> Fix For: 4.1, 5.0
>
>
> setup:
> 1 node, 4 cores, 2 shards.
> 15 documents indexed.
> problem:
> init stage times out.
> probable cause:
> According to the init flow, cores are initialized one by one synchronously.
> Actually, the main thread waits
> ShardLeaderElectionContext.waitForReplicasToComeUp until retry threshold,
> while replica cores are not yet initialized, in other words there is no
> chance other replicas go up in the meanwhile.
> stack trace:
> Thread [main] (Suspended)
> owns: HashMap<K,V> (id=3876)
> owns: StandardContext (id=3877)
> owns: HashMap<K,V> (id=3878)
> owns: StandardHost (id=3879)
> owns: StandardEngine (id=3880)
> owns: Service[] (id=3881)
> Thread.sleep(long) line: not available [native method]
> ShardLeaderElectionContext.waitForReplicasToComeUp(boolean, String)
> line: 298
> ShardLeaderElectionContext.runLeaderProcess(boolean) line: 143
> LeaderElector.runIamLeaderProcess(ElectionContext, boolean) line: 152
> LeaderElector.checkIfIamLeader(int, ElectionContext, boolean) line: 96
> LeaderElector.joinElection(ElectionContext) line: 262
> ZkController.joinElection(CoreDescriptor, boolean) line: 733
> ZkController.register(String, CoreDescriptor, boolean, boolean) line:
> 566
> ZkController.register(String, CoreDescriptor) line: 532
> CoreContainer.registerInZk(SolrCore) line: 709
> CoreContainer.register(String, SolrCore, boolean) line: 693
> CoreContainer.load(String, InputSource) line: 535
> CoreContainer.load(String, File) line: 356
> CoreContainer$Initializer.initialize() line: 308
> SolrDispatchFilter.init(FilterConfig) line: 107
> ApplicationFilterConfig.getFilter() line: 295
> ApplicationFilterConfig.setFilterDef(FilterDef) line: 422
> ApplicationFilterConfig.<init>(Context, FilterDef) line: 115
> StandardContext.filterStart() line: 4072
> StandardContext.start() line: 4726
> StandardHost(ContainerBase).addChildInternal(Container) line: 799
> StandardHost(ContainerBase).addChild(Container) line: 779
> StandardHost.addChild(Container) line: 601
> HostConfig.deployDescriptor(String, File, String) line: 675
> HostConfig.deployDescriptors(File, String[]) line: 601
> HostConfig.deployApps() line: 502
> HostConfig.start() line: 1317
> HostConfig.lifecycleEvent(LifecycleEvent) line: 324
> LifecycleSupport.fireLifecycleEvent(String, Object) line: 142
> StandardHost(ContainerBase).start() line: 1065
> StandardHost.start() line: 840
> StandardEngine(ContainerBase).start() line: 1057
> StandardEngine.start() line: 463
> StandardService.start() line: 525
> StandardServer.start() line: 754
> Catalina.start() line: 595
> NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not
> available [native method]
> NativeMethodAccessorImpl.invoke(Object, Object[]) line: not available
> DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: not
> available
> Method.invoke(Object, Object...) line: not available
> Bootstrap.start() line: 289
> Bootstrap.main(String[]) line: 414
>
> After a while, the session times out and following exception appears:
> Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=0 timeoutin=-95
> Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Was waiting for replicas to come up, but they are taking too long -
> assuming they won't come back till later
> Oct 25, 2012 1:16:56 PM org.apache.solr.common.SolrException log
> SEVERE: Errir checking for the number of election
> participants:org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for
> /collections/collection1/leader_elect/shard2/election
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
> at
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:227)
> at
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:224)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63)
> at
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:224)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp(ElectionContext.java:276)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:143)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:152)
> at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:262)
> at
> org.apache.solr.cloud.ZkController.joinElection(ZkController.java:733)
> at org.apache.solr.cloud.ZkController.register(ZkController.java:566)
> at org.apache.solr.cloud.ZkController.register(ZkController.java:532)
> at
> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:709)
> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:693)
> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:535)
> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
> at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> at
> org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.java:115)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
> at
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
> at
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
> at
> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
> at
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:675)
> at
> org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:601)
> at
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502)
> at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317)
> at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
> at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065)
> at org.apache.catalina.core.StandardHost.start(StandardHost.java:840)
> at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057)
> at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463)
> at
> org.apache.catalina.core.StandardService.start(StandardService.java:525)
> at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:754)
> at org.apache.catalina.startup.Catalina.start(Catalina.java:595)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
> at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
> Followed by:
> Oct 25, 2012 1:17:27 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> SEVERE: Recovery failed - trying again... core=collection1
> Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover. core=collection1
> Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover.
> core=collection1:org.apache.solr.common.SolrException: No registered leader
> was found, collection:collection1 slice:shard1
> at
> org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:413)
> at
> org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:399)
> at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:318)
> at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]