[jira] [Comment Edited] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

Mark Miller (JIRA) Tue, 13 May 2014 22:31:18 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996039#comment-13996039
 ]


Mark Miller edited comment on SOLR-6056 at 5/13/14 4:29 AM:
------------------------------------------------------------

Just  reviewing by iphone, but I think the move in 1 is likely unnecessary as 
it some what defeats the purpose of the quick publish comment. It's really just 
a a best effort thing, and we can likely fall back to the natural publish that 
running recovery does. 


was (Author: [email protected]):
Just  reviewing by iphone, but I think the move in 1 is likely unnecessary as 
it some what's defeats the purpose of the quick publish comment. It's really 
just a a best effort thing in relieve, and we can likely fall back to natural 
publish that running recovery does. 

> Zookeeper crash JVM stack OOM because of recover strategy 
> ----------------------------------------------------------
>
>                 Key: SOLR-6056
>                 URL: https://issues.apache.org/jira/browse/SOLR-6056
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.6
>         Environment: Two linux servers, 65G memory, 16 core cpu
> 20 collections, every collection has one shard two replica 
> one zookeeper
>            Reporter: Raintung Li
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>              Labels: cluster, crash, recover
>         Attachments: patch-6056.txt
>
>
> Some errors"org.apache.solr.common.SolrException: Error opening new searcher. 
> exceeded limit of maxWarmingSearchers=2, try again later", that occur 
> distributedupdateprocessor trig the core admin recover process.
> That means every update request will send the core admin recover request.
> (see the code DistributedUpdateProcessor.java doFinish())
> The terrible thing is CoreAdminHandler will start a new thread to publish the 
> recover status and start recovery. Threads increase very quickly, and stack 
> OOM , Overseer can't handle a lot of status update , zookeeper node for  
> /overseer/queue/qn-0000125553 increase more than 40 thousand in two minutes.
> At the last zookeeper crash. 
> The worse thing is queue has too much nodes in the zookeeper, the cluster 
> can't publish the right status because only one overseer work, I have to 
> start three threads to clear the queue nodes. The cluster doesn't work normal 
> near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

Reply via email to