example log line: [DEBUG][action.admin.indices.status] [Red Ronin] 
[*index*][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to 
executed 
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161] 
org.elasticsearch.transport.NodeDisconnectedException: 
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting for... 
something to happen. Each individual node returns 200 when curled locally. 
A huge number of this above log line appear at the end of this process -- 
one for every single shard on the node, which is a huge vomit into my logs. 
As soon as a node is restarted the cluster "snaps back" and immediately 
fails outstanding requests and begins rebalancing. It even stops responding 
to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:
>
> Hi,
>
> We've been using elasticsearch on AWS for our application for two 
> purposes: as a search engine for user-created documents, and as a cache for 
> activity feeds in our application. We made a decision early-on to treat 
> every customer's content as a distinct index, for full logical separation 
> of customer data. We have about three hundred indexes in our cluster, with 
> the default 5-shards/1-replica setup.
>
> Recently, we've had major problems with the cluster "locking up" to 
> requests and losing track of its nodes. We initially responded by 
> attempting to remove possible CPU and memory limits, and placed all nodes 
> in the same AWS placement group, to maximize inter-node bandwidth, all to 
> no avail. We eventually lost an entire production cluster, resulting in a 
> decision to split the indexes across two completely independent clusters, 
> each cluster taking half of the indexes, with application-level logic 
> determining where the indexes were.
>
> All that is to say: with our setup, are we running into an undocumented 
> *practical* limit on the number of indexes or shards in a cluster? It 
> ends up being around 3000 shards with our setup. Our logs show evidence of 
> nodes timing out their responses to massive shard status-checks, and it 
> gets *worse* the more nodes there are in the cluster. It's also stable 
> with only *two* nodes.
>
> Thanks,
> -David
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/97d50096-5fd9-40ff-a6a6-900571808c23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to