[
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325575#comment-15325575
]
Erick Erickson commented on SOLR-7191:
--------------------------------------
I had to chase after this for a while, so I'm recording results
of some testing for posterity.
> Setup: 4 Solr JVMs, 8G each (64G total RAM on the machine).
> Create 100 4x4 collections (i.e. 4 replicas, 4 shards each). 1,600 total
> shards
> Note that the cluster is fine at this point, everything's green.
> No data indexed at all.
> Shut all Solr instances down.
> Bring up a Solr on a different box. I did this to eliminate the chance
that the Overseer was somehow involved since it is now on the machine
with no replicas. I don't think this matters much though.
> Bring up one JVM.
> Wait for all the nodes on that JVM to come up. Now every shard has a leader,
and the collections are all green, 3 of 4 replicas for each shard are
"gone" of course, but it's a functioning cluster.
> Bring up the next JVM: Kabloooey. Very shortly you'll start to see OOM
errors on the _second_ JVM but not the first.
> The numbers of threads on the first JVM are about 1,200. On the second,
they go over 2,000. Whether this would drop back down or not
is an open question.
> So I tried playing with -Xss to drop the size of the stack on the threads
and even dropping by half didn't help.
> Expanding the memory on the second JVM to 32G didn't help
> I tried increasing the processes to no avail (ulimit -u) on a hint
that there was a wonky effect there somehow.
> Especially disconcerting is the fact that this node was running fine
when the collections were _created_, it just can't get past restart.
> Changing coreLoadThreads even down to 2 did not seem to help.
> At no point does the reported memory consumption via jConsole or top
show even getting close to the allocated JVM limits.
> I'd like to be able to just start all 4 JVMs at once, but didn't get
that far.
> If one tries to start additional JVMs anyway, there's a lot of thrashing
around, replicas go into recovery, go out of recovery, are permanently down
etc.
Of course with OOMs it's unclear what _should_ happen.
> The OOM killer script apparently does NOT get triggered, I think the OOM
is swallowed, perhaps in Zookeeper client code. Note that if the OOM
killer script _did_ get fired there'd the second & greater JVMs would
ust die.
> Error is OOM: Unable to create new native thread.
> Here's a stack trace, there are a _lot_ of these...
ERROR - 2016-06-11 00:05:36.806; [ ]
org.apache.zookeeper.ClientCnxn$EventThread; Error while calling watcher
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:214)
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at
org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:266)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Improve stability and startup performance of SolrCloud with thousands of
> collections
> ------------------------------------------------------------------------------------
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.0
> Reporter: Shawn Heisey
> Assignee: Shalin Shekhar Mangar
> Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch,
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3,
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many
> problems myself even before I was able to get 4000 collections created on a
> 5.0 example cloud setup. Restarting Solr takes a very long time, and it is
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance
> and scalability. It doesn't help that I'm running both Solr nodes on one
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]