I should also note that I've been using OpenJDK. I'm currently in the process of moving to the official Oracle binaries; are there specific optimizations changes there that help with inter-cluster IO? There's some hints at that in this very old github-elasticsearch interview <http://exploringelasticsearch.com/github_interview.html>.
On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote: > > example log line: [DEBUG][action.admin.indices.status] [Red Ronin] > [*index*][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to > executed > [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161] > org.elasticsearch.transport.NodeDisconnectedException: > [Shotgun][inet[/IP:9300]][indices/status/s] disconnected > > When the cluster gets into this state, all requests hang waiting for... > something to happen. Each individual node returns 200 when curled locally. > A huge number of this above log line appear at the end of this process -- > one for every single shard on the node, which is a huge vomit into my logs. > As soon as a node is restarted the cluster "snaps back" and immediately > fails outstanding requests and begins rebalancing. It even stops responding > to bigdesk requests. > > On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote: >> >> Hi, >> >> We've been using elasticsearch on AWS for our application for two >> purposes: as a search engine for user-created documents, and as a cache for >> activity feeds in our application. We made a decision early-on to treat >> every customer's content as a distinct index, for full logical separation >> of customer data. We have about three hundred indexes in our cluster, with >> the default 5-shards/1-replica setup. >> >> Recently, we've had major problems with the cluster "locking up" to >> requests and losing track of its nodes. We initially responded by >> attempting to remove possible CPU and memory limits, and placed all nodes >> in the same AWS placement group, to maximize inter-node bandwidth, all to >> no avail. We eventually lost an entire production cluster, resulting in a >> decision to split the indexes across two completely independent clusters, >> each cluster taking half of the indexes, with application-level logic >> determining where the indexes were. >> >> All that is to say: with our setup, are we running into an undocumented >> *practical* limit on the number of indexes or shards in a cluster? It >> ends up being around 3000 shards with our setup. Our logs show evidence of >> nodes timing out their responses to massive shard status-checks, and it >> gets *worse* the more nodes there are in the cluster. It's also stable >> with only *two* nodes. >> >> Thanks, >> -David >> > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
