[ 
https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311681#comment-14311681
 ] 

Gopal Patwa commented on SOLR-5961:
-----------------------------------

Thanks Mark, here more details for our production issue, I will try to 
reproduce this issue.

Restart sequence:
Solr - Restarted 02/03/3015 (8 Nodes, 10 Collection)
ZKR - Restarted 02/04/2015 (5 Nodes)

Normal Index Size are approx 5GB. Only few nodes had this issue

When replica was in recovery, transaction logs size was over 100GB.
Possible reason it starts writing all updates sent by the leader in this period 
to the transaction log .

Due to overseer queue size large, Admin UI Cloud tree view hangs, may be 
similar to below jira issue
https://issues.apache.org/jira/browse/SOLR-6395

Exceptions During this time:

Zookeeper Log:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /overseer/queue

Solr Log:

2015-02-05 23:23:13,174 [] priority=ERROR app_name= thread=RecoveryThread 
location=RecoveryStrategy line=142 Error while trying to recover. 
core=city_shard1_replica2:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was 
asked to wait on state recovering for shard1 in city on 
srwp01usc002.stubprod.com:8080_solr but I still do not see the requested state. 
I see state: active live:true leader from ZK: 
http://srwp01usc001.stubprod.com:8080/solr/city_shard1_replica1/
 at java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.util.concurrent.FutureTask.get(FutureTask.java:188)
 at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615)

> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
>                 Key: SOLR-5961
>                 URL: https://issues.apache.org/jira/browse/SOLR-5961
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes 
> (separate machines)
>            Reporter: Maxim Novikov
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with 
> the following messages:
> 419158 [localhost-startStop-1-EventThread] INFO  
> org.apache.solr.cloud.DistributedQueue  ? LatchChildWatcher fired on path: 
> /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  org.apache.solr.cloud.Overseer  ? Update state 
> numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr";,
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all 
> the nodes does not help. I have even tried to stop all the nodes in the 
> cluster first, but then when I start one, the behavior doesn't change, it 
> gets crazy nuts with this " /overseer/queue state" again.
> PS The only way to handle this was to stop everything, manually clean up all 
> the data in ZooKeeper related to Solr, and then rebuild everything from 
> scratch. As you should understand, it is kinda unbearable in the production 
> environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to