I should also note that I've been using OpenJDK. I'm currently in the 
process of moving to the official Oracle binaries; are there specific 
optimizations changes there that help with inter-cluster IO? There's some 
hints at that in this very old github-elasticsearch interview 
<http://exploringelasticsearch.com/github_interview.html>.

On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:
>
> example log line: [DEBUG][action.admin.indices.status] [Red Ronin] 
> [*index*][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to 
> executed 
> [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161] 
> org.elasticsearch.transport.NodeDisconnectedException: 
> [Shotgun][inet[/IP:9300]][indices/status/s] disconnected
>
> When the cluster gets into this state, all requests hang waiting for... 
> something to happen. Each individual node returns 200 when curled locally. 
> A huge number of this above log line appear at the end of this process -- 
> one for every single shard on the node, which is a huge vomit into my logs. 
> As soon as a node is restarted the cluster "snaps back" and immediately 
> fails outstanding requests and begins rebalancing. It even stops responding 
> to bigdesk requests.
>
> On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:
>>
>> Hi,
>>
>> We've been using elasticsearch on AWS for our application for two 
>> purposes: as a search engine for user-created documents, and as a cache for 
>> activity feeds in our application. We made a decision early-on to treat 
>> every customer's content as a distinct index, for full logical separation 
>> of customer data. We have about three hundred indexes in our cluster, with 
>> the default 5-shards/1-replica setup.
>>
>> Recently, we've had major problems with the cluster "locking up" to 
>> requests and losing track of its nodes. We initially responded by 
>> attempting to remove possible CPU and memory limits, and placed all nodes 
>> in the same AWS placement group, to maximize inter-node bandwidth, all to 
>> no avail. We eventually lost an entire production cluster, resulting in a 
>> decision to split the indexes across two completely independent clusters, 
>> each cluster taking half of the indexes, with application-level logic 
>> determining where the indexes were.
>>
>> All that is to say: with our setup, are we running into an undocumented 
>> *practical* limit on the number of indexes or shards in a cluster? It 
>> ends up being around 3000 shards with our setup. Our logs show evidence of 
>> nodes timing out their responses to massive shard status-checks, and it 
>> gets *worse* the more nodes there are in the cluster. It's also stable 
>> with only *two* nodes.
>>
>> Thanks,
>> -David
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to