example log line: [DEBUG][action.admin.indices.status] [Red Ronin] [*index*][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161] org.elasticsearch.transport.NodeDisconnectedException: [Shotgun][inet[/IP:9300]][indices/status/s] disconnected
When the cluster gets into this state, all requests hang waiting for... something to happen. Each individual node returns 200 when curled locally. A huge number of this above log line appear at the end of this process -- one for every single shard on the node, which is a huge vomit into my logs. As soon as a node is restarted the cluster "snaps back" and immediately fails outstanding requests and begins rebalancing. It even stops responding to bigdesk requests. On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote: > > Hi, > > We've been using elasticsearch on AWS for our application for two > purposes: as a search engine for user-created documents, and as a cache for > activity feeds in our application. We made a decision early-on to treat > every customer's content as a distinct index, for full logical separation > of customer data. We have about three hundred indexes in our cluster, with > the default 5-shards/1-replica setup. > > Recently, we've had major problems with the cluster "locking up" to > requests and losing track of its nodes. We initially responded by > attempting to remove possible CPU and memory limits, and placed all nodes > in the same AWS placement group, to maximize inter-node bandwidth, all to > no avail. We eventually lost an entire production cluster, resulting in a > decision to split the indexes across two completely independent clusters, > each cluster taking half of the indexes, with application-level logic > determining where the indexes were. > > All that is to say: with our setup, are we running into an undocumented > *practical* limit on the number of indexes or shards in a cluster? It > ends up being around 3000 shards with our setup. Our logs show evidence of > nodes timing out their responses to massive shard status-checks, and it > gets *worse* the more nodes there are in the cluster. It's also stable > with only *two* nodes. > > Thanks, > -David > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/97d50096-5fd9-40ff-a6a6-900571808c23%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
