[jira] [Commented] (SOLR-6261) Run ZK watch event callbacks in parallel to the event thread

Shalin Shekhar Mangar (JIRA) Wed, 06 Aug 2014 16:33:31 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088466#comment-14088466
 ]


Shalin Shekhar Mangar commented on SOLR-6261:
---------------------------------------------

Guys, I think there's something weird happening since this was committed. Many 
tests such as MultiThreadedOCPTest and ShardSplitTest have been failing with 
OutOfMemory trying to create new watcher threads. A typical fail has the 
following in logs:

{code}
 [junit4]   2> 1218223 T4785 oasc.DistributedQueue$LatchChildWatcher.process 
LatchChildWatcher fired on path: /overseer/collection-queue-work state: 
SyncConnected type NodeChildrenChanged
   [junit4]   2> 1218223 T4789 oasc.DistributedQueue$LatchChildWatcher.process 
LatchChildWatcher fired on path: /overseer/collection-queue-work state: 
SyncConnected type NodeChildrenChanged
   [junit4]   2> 1218223 T4791 oasc.DistributedQueue$LatchChildWatcher.process 
LatchChildWatcher fired on path: /overseer/collection-queue-work state: 
SyncConnected type NodeChildrenChanged
   [junit4]   2> 1218223 T4795 oasc.DistributedQueue$LatchChildWatcher.process 
LatchChildWatcher fired on path: /overseer/collection-queue-work state: 
SyncConnected type NodeChildrenChanged
   [junit4]   2> 1218223 T4797 oasc.DistributedQueue$LatchChildWatcher.process 
LatchChildWatcher fired on path: /overseer/collection-queue-work state: 
SyncConnected type NodeChildrenChanged
   [junit4]   2> 1218222 T4803 oasc.DistributedQueue$LatchChildWatcher.process 
LatchChildWatcher fired on path: /overseer/collection-queue-work state: 
SyncConnected type NodeChildrenChanged
   [junit4]   2> 1218222 T3305 oaz.ClientCnxn$EventThread.processEvent ERROR 
Error while calling watcher  java.lang.OutOfMemoryError: unable to create new 
native thread
   [junit4]   2>        at java.lang.Thread.start0(Native Method)
   [junit4]   2>        at java.lang.Thread.start(Thread.java:714)
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
   [junit4]   2>        at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:201)
   [junit4]   2>        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   [junit4]   2>        at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
   [junit4]   2> 
{code}
I see hundreds of LatchChildWatcher.process events and then the node goes out 
of memory.

Here are some of the recent fails:
http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/4233/
https://builds.apache.org/job/Lucene-Solr-NightlyTests-4.x/592/
https://builds.apache.org/job/Lucene-Solr-Tests-4.x-Java7/2048/

> Run ZK watch event callbacks in parallel to the event thread
> ------------------------------------------------------------
>
>                 Key: SOLR-6261
>                 URL: https://issues.apache.org/jira/browse/SOLR-6261
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>    Affects Versions: 4.9
>            Reporter: Ramkumar Aiyengar
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 5.0, 4.10
>
>
> Currently checking for leadership (due to the leader's ephemeral node going 
> away) happens in ZK's event thread. If there are many cores and all of them 
> are due leadership, then they would have to serially go through the two-way 
> sync and leadership takeover.
> For tens of cores, this could mean 30-40s without leadership before the last 
> in the list even gets to start the leadership process. If the leadership 
> process happens in a separate thread, then the cores could all take over in 
> parallel.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6261) Run ZK watch event callbacks in parallel to the event thread

Reply via email to