[ https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308716#comment-14308716 ]
Gopal Patwa commented on SOLR-5961: ----------------------------------- we also had similar problem today as Ugo mention in our Production system, this was cause after machine reboot for zookeeper (5 node) and 8 node solr cloud (single shard) to install some unix security patch. JDK 7, Solr 4.10.3, CentOS But after reboot, we saw huge amount of message were in overseer/queue ./zkCli.sh -server localhost:2181 ls /search/catalog/overseer/queue | sed 's/,/\n/g' | wc -l 178587 We have very small cluster (8 nodes), how come overseer/queue should have 17k+ messages, due to this leader node took almost few hours to come from recovery. Logs from zookeeper: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer/queue > Solr gets crazy on /overseer/queue state change > ----------------------------------------------- > > Key: SOLR-5961 > URL: https://issues.apache.org/jira/browse/SOLR-5961 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.7.1 > Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes > (separate machines) > Reporter: Maxim Novikov > Assignee: Shalin Shekhar Mangar > Priority: Critical > > No idea how to reproduce it, but sometimes Solr stars littering the log with > the following messages: > 419158 [localhost-startStop-1-EventThread] INFO > org.apache.solr.cloud.DistributedQueue ? LatchChildWatcher fired on path: > /overseer/queue state: SyncConnected type NodeChildrenChanged > 419190 [Thread-3] INFO org.apache.solr.cloud.Overseer ? Update state > numShards=1 message={ > "operation":"state", > "state":"recovering", > "base_url":"http://${IP_ADDRESS}/solr", > "core":"${CORE_NAME}", > "roles":null, > "node_name":"${NODE_NAME}_solr", > "shard":"shard1", > "collection":"${COLLECTION_NAME}", > "numShards":"1", > "core_node_name":"core_node2"} > It continues spamming these messages with no delay and the restarting of all > the nodes does not help. I have even tried to stop all the nodes in the > cluster first, but then when I start one, the behavior doesn't change, it > gets crazy nuts with this " /overseer/queue state" again. > PS The only way to handle this was to stop everything, manually clean up all > the data in ZooKeeper related to Solr, and then rebuild everything from > scratch. As you should understand, it is kinda unbearable in the production > environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org