[ 
https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121411#comment-14121411
 ] 

Ugo Matrangolo commented on SOLR-5961:
--------------------------------------

Hi,

we got same behaviour that caused major outage. After a rolling restart of our 
production zk cluster the /overseer/queue started to be spammed by events and 
the Overseer was looping on the following:

2014-08-28 22:43:06,753 [main-EventThread] INFO  
org.apache.solr.cloud.DistributedQueue  - LatchChildWatcher fired on path: 
/overseer/queue state: SyncConnected type NodeChildrenChanged
2014-08-28 22:43:06,820 [Thread-125] INFO  org.apache.solr.cloud.Overseer  - 
Update state numShards=3 message={
  "operation":"state",
  "state":"recovering",
  "base_url":"http://{ip}:{port}";,
  "core":"warehouse-skus_shard2_replica2",
  "roles":null,
  "node_name":"{ip}:{port}_",
  "shard":"shard2",
  "collection":"warehouse-skus",
  "numShards":"3",
  "core_node_name":"core_node4"}

The fix was to:
1. Shutdown completely the cluster
2. Using zkCli we rmr-ed all the /queue(s) under the /overseer
3. Restarted

Solr started fine after that as nothing has happened.

The outage was caused by clusterstate.json being in an inconsistent state (most 
of the nodes were incorrectly marked as down) with no chance to get things 
right given that the Overseer was too busy processing the spammed queues (more 
than 13k+ msgs in there) to fix the clusterstate.json with the latest situation.

> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
>                 Key: SOLR-5961
>                 URL: https://issues.apache.org/jira/browse/SOLR-5961
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes 
> (separate machines)
>            Reporter: Maxim Novikov
>            Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with 
> the following messages:
> 419158 [localhost-startStop-1-EventThread] INFO  
> org.apache.solr.cloud.DistributedQueue  ? LatchChildWatcher fired on path: 
> /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  org.apache.solr.cloud.Overseer  ? Update state 
> numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr";,
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all 
> the nodes does not help. I have even tried to stop all the nodes in the 
> cluster first, but then when I start one, the behavior doesn't change, it 
> gets crazy nuts with this " /overseer/queue state" again.
> PS The only way to handle this was to stop everything, manually clean up all 
> the data in ZooKeeper related to Solr, and then rebuild everything from 
> scratch. As you should understand, it is kinda unbearable in the production 
> environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to