[ 
https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029055#comment-14029055
 ] 

Shalin Shekhar Mangar commented on SOLR-6056:
---------------------------------------------

I don't understand #2 because cancelRecovery will make sure that an old 
recovery thread joins and so at any given point there should be only one thread 
doing the recovery. If that is true, the while (recoveryRunning) block is also 
redundant. Am I missing something?

Overall this doesn't seem like the best strategy because when each recovery 
request from DUPF.finish() will create a recovery thread and the next such 
request will cancel the last recovery and start a new recovery thread and so 
on. Also, if a long running recovery is in progress and there are additional 
request recovery threads coming in then the core will rapidly block up threads 
increasing the chances of deadlock.

> Zookeeper crash JVM stack OOM because of recover strategy 
> ----------------------------------------------------------
>
>                 Key: SOLR-6056
>                 URL: https://issues.apache.org/jira/browse/SOLR-6056
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.6
>         Environment: Two linux servers, 65G memory, 16 core cpu
> 20 collections, every collection has one shard two replica 
> one zookeeper
>            Reporter: Raintung Li
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>              Labels: cluster, crash, recover
>         Attachments: patch-6056.txt
>
>
> Some errors"org.apache.solr.common.SolrException: Error opening new searcher. 
> exceeded limit of maxWarmingSearchers=2, try again later", that occur 
> distributedupdateprocessor trig the core admin recover process.
> That means every update request will send the core admin recover request.
> (see the code DistributedUpdateProcessor.java doFinish())
> The terrible thing is CoreAdminHandler will start a new thread to publish the 
> recover status and start recovery. Threads increase very quickly, and stack 
> OOM , Overseer can't handle a lot of status update , zookeeper node for  
> /overseer/queue/qn-0000125553 increase more than 40 thousand in two minutes.
> At the last zookeeper crash. 
> The worse thing is queue has too much nodes in the zookeeper, the cluster 
> can't publish the right status because only one overseer work, I have to 
> start three threads to clear the queue nodes. The cluster doesn't work normal 
> near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to