The unhealthy clusters were between four and five nodes. We switched to two two-node clusters and those have been stable.
Bigdesk reports file descriptors, memory, and CPU all have plentiful headroom in all cases. On Monday, October 20, 2014 11:54:21 AM UTC-4, Jörg Prante wrote: > > How many nodes do you have in your cluster? > > Have you checked if your nodes run out of file descriptors or heap memory? > > Jörg > > On Mon, Oct 20, 2014 at 5:52 PM, David Ashby <[email protected] > <javascript:>> wrote: > >> I might also note: the size of these indexes varies wildly, some being >> just a few documents, some being thousands, more or less following the >> power law. >> >> >> On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote: >>> >>> Hi, >>> >>> We've been using elasticsearch on AWS for our application for two >>> purposes: as a search engine for user-created documents, and as a cache for >>> activity feeds in our application. We made a decision early-on to treat >>> every customer's content as a distinct index, for full logical separation >>> of customer data. We have about three hundred indexes in our cluster, with >>> the default 5-shards/1-replica setup. >>> >>> Recently, we've had major problems with the cluster "locking up" to >>> requests and losing track of its nodes. We initially responded by >>> attempting to remove possible CPU and memory limits, and placed all nodes >>> in the same AWS placement group, to maximize inter-node bandwidth, all to >>> no avail. We eventually lost an entire production cluster, resulting in a >>> decision to split the indexes across two completely independent clusters, >>> each cluster taking half of the indexes, with application-level logic >>> determining where the indexes were. >>> >>> All that is to say: with our setup, are we running into an undocumented >>> *practical* limit on the number of indexes or shards in a cluster? It >>> ends up being around 3000 shards with our setup. Our logs show evidence of >>> nodes timing out their responses to massive shard status-checks, and it >>> gets *worse* the more nodes there are in the cluster. It's also stable >>> with only *two* nodes. >>> >>> Thanks, >>> -David >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dc2d873d-28ed-40c9-94d9-eb1da37d1caa%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
