[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908116#comment-13908116 ]
Markus Jelsma commented on SOLR-5579: ------------------------------------- Not sure im struck by this too but a cluster started to fail after a successful collection reload. > Leader stops processing collection-work-queue after failed collection reload > ---------------------------------------------------------------------------- > > Key: SOLR-5579 > URL: https://issues.apache.org/jira/browse/SOLR-5579 > Project: Solr > Issue Type: Bug > Affects Versions: 4.5.1 > Environment: Debian Linux 6.0 running on VMWare > Using embedded SOLR Jetty. > Reporter: Eric Bus > Assignee: Mark Miller > Labels: collections, queue > > I've been experiencing the same problem a few times now. My leader in > /overseer_elect/leader stops processing the collection queue at > /overseer/collection-queue-work. The queue will build up and it will trigger > an alert in my monitoring tool. > I haven't been able to pinpoint the reason that the leader stops, but usually > I kill the leader node to trigger a leader election. The new node will pick > up the queue. And this is where the problems start. > When the new leader is processing the queue and picks up a reload for a shard > without an active leader, the queue stops. It keeps repeating the message > that there is no active leader for the shard. But a new leader is never > elected: > {quote} > ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error > while trying to recover. > core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No > registered leader was found, collection:magento_349 slice:shard1 > at > org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) > at > org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) > ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; > Recovery failed - trying again... (7) core=magento_349_shard1_replica1 > INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait > 256.0 seconds before trying to recover again (8) > {quote} > Is the leader election in some way connected to the collection queue? If so, > can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org