Re: Upper limits on indexes/shards in a cluster

David Ashby Tue, 21 Oct 2014 14:47:01 -0700

Yep, turns out that calls _status on the entire cluster every time it runs. 
That might get... uncomfortable. We're submitting a bug report to Elastica 
to at the very least get them to update their documentation to mark that 
codepath as deprecated.


On Tuesday, October 21, 2014 3:29:47 PM UTC-4, David Ashby wrote:
>
> Hmm, maybe. We are using the Elastica PHP library and call 
> getStatus()->getServerStatus() 
> relatively often (to try and work around elastica's lack of proper error 
> handling of unreachable nodes) to determine if we have a node we can 
> connect to or not. If that call maps to IndicesStatusRequest in the end 
> we might be shooting ourselves in the foot.
>
> On Tuesday, October 21, 2014 3:25:10 PM UTC-4, Jörg Prante wrote:
>>
>> Maybe you are hit by 
>> https://github.com/elasticsearch/elasticsearch/issues/7385
>>
>> Jörg
>>
>> On Tue, Oct 21, 2014 at 9:17 PM, [email protected] <[email protected]> 
>> wrote:
>>
>>> This has nothing to do with OpenJDK.
>>>
>>> IndicesStatusRequest (deprecated, will be removed from future versions) 
>>> is a heavy request, there may be something on your machines which takes 
>>> longer than 5 seconds, so the request times out.
>>>
>>> The IndicesStatus action uses Directories.estimateSize of Lucene. This 
>>> call might take some time on large directories, maybe you have many 
>>> segments/unoptimized shards/indices.
>>>
>>> Jörg
>>>
>>> On Tue, Oct 21, 2014 at 6:21 PM, David Ashby <[email protected]> 
>>> wrote:
>>>
>>>> I should also note that I've been using OpenJDK. I'm currently in the 
>>>> process of moving to the official Oracle binaries; are there specific 
>>>> optimizations changes there that help with inter-cluster IO? There's some 
>>>> hints at that in this very old github-elasticsearch interview 
>>>> <http://exploringelasticsearch.com/github_interview.html>.
>>>>
>>>>
>>>> On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:
>>>>>
>>>>> example log line: [DEBUG][action.admin.indices.status] [Red Ronin] 
>>>>> [*index*][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to 
>>>>> executed 
>>>>> [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161]
>>>>>  
>>>>> org.elasticsearch.transport.NodeDisconnectedException: 
>>>>> [Shotgun][inet[/IP:9300]][indices/status/s] disconnected
>>>>>
>>>>> When the cluster gets into this state, all requests hang waiting 
>>>>> for... something to happen. Each individual node returns 200 when curled 
>>>>> locally. A huge number of this above log line appear at the end of this 
>>>>> process -- one for every single shard on the node, which is a huge vomit 
>>>>> into my logs. As soon as a node is restarted the cluster "snaps back" and 
>>>>> immediately fails outstanding requests and begins rebalancing. It even 
>>>>> stops responding to bigdesk requests.
>>>>>
>>>>> On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We've been using elasticsearch on AWS for our application for two 
>>>>>> purposes: as a search engine for user-created documents, and as a cache 
>>>>>> for 
>>>>>> activity feeds in our application. We made a decision early-on to treat 
>>>>>> every customer's content as a distinct index, for full logical 
>>>>>> separation 
>>>>>> of customer data. We have about three hundred indexes in our cluster, 
>>>>>> with 
>>>>>> the default 5-shards/1-replica setup.
>>>>>>
>>>>>> Recently, we've had major problems with the cluster "locking up" to 
>>>>>> requests and losing track of its nodes. We initially responded by 
>>>>>> attempting to remove possible CPU and memory limits, and placed all 
>>>>>> nodes 
>>>>>> in the same AWS placement group, to maximize inter-node bandwidth, all 
>>>>>> to 
>>>>>> no avail. We eventually lost an entire production cluster, resulting in 
>>>>>> a 
>>>>>> decision to split the indexes across two completely independent 
>>>>>> clusters, 
>>>>>> each cluster taking half of the indexes, with application-level logic 
>>>>>> determining where the indexes were.
>>>>>>
>>>>>> All that is to say: with our setup, are we running into an 
>>>>>> undocumented *practical* limit on the number of indexes or shards in 
>>>>>> a cluster? It ends up being around 3000 shards with our setup. Our logs 
>>>>>> show evidence of nodes timing out their responses to massive shard 
>>>>>> status-checks, and it gets *worse* the more nodes there are in the 
>>>>>> cluster. It's also stable with only *two* nodes.
>>>>>>
>>>>>> Thanks,
>>>>>> -David
>>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c4db397d-90f0-45bd-8f1e-b27581c7fa67%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Upper limits on indexes/shards in a cluster

Reply via email to