Updating again:

If we reduce the number of shards per node to below ~350, the system 
operates fine. Once we go above that (number_of_indices * 
number_of_shards_per_index * number_of_replicas / number_of_nodes), we 
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:
>
> Just a quick update, we duplicated our test environment to see if this 
> issue was fixed by upgrading to 1.4.1 instead. We received the same errors 
> under 1.4.1.
>
> On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:
>>
>> As a followup, I closed all the indices on the cluster. I would then open 
>> 1 index and optimize it down to 1 segment. I made it through ~60% of the 
>> indices (and probably ~45% of the data) before the same errors showed up in 
>> the master log and the same behavior resumed.
>>
>> On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:
>>>
>>> I replied once, but it seems to have disappeared, so if this gets double 
>>> posted, I'm sorry.
>>>
>>> We disabled all monitoring when we started looking into the issues to 
>>> ensure there was no external load on ES. Everything we are currently seeing 
>>> is just whatever activity ES generates internally.
>>>
>>> My understanding regarding optimizing indices is that you shouldn't call 
>>> it explicitly on indices that are regularly updating, rather you should let 
>>> the background merge process handle things. As the majority of our indices 
>>> regularly update, we don't explicitly call optimize on them. I can try to 
>>> call it on them all and see if it helps.
>>>
>>> As for disk speed, we are currently running ES on SSDs. We have it in 
>>> our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as 
>>> we have been getting acceptable performance thus far.
>>>
>>> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:
>>>>
>>>> Do you have a monitor tool running?
>>>>
>>>> I recommend to switch it off, and optimize your indices, and then 
>>>> update your monitoring tools.
>>>>
>>>> Seems you have many segments/slow disk to get them reported in 15s.
>>>>
>>>> Jörg
>>>> Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>:
>>>>
>>>>> This is running on Amazon EC2 in a VPC on dedicated instances. 
>>>>> Physical network infrastructure is likely fine. Are there specific 
>>>>> network 
>>>>> issues you think we should look into?
>>>>>
>>>>> When we are in a problem state, we can communicate between the nodes 
>>>>> just fine. I can run curl requests to ES (health checks, etc) from the 
>>>>> master node to the data nodes directly and they return as expected. So, 
>>>>> there doesn't seem to be a socket exhaustion issue (additionally there 
>>>>> are 
>>>>> no kernel errors being reported).
>>>>>
>>>>> It feels like there is a queue/buffer filling up somewhere that once 
>>>>> it has availability again, things start working. But, /_cat/thread_pool?v 
>>>>> doesn't show anything above 0 (although, when we are in the problem 
>>>>> state, 
>>>>> it doesn't return a response if run on master), nodes/hot_threads doesn't 
>>>>> show anything going on, etc.
>>>>>
>>>>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:
>>>>>>
>>>>>> I would think the network is a prime suspect then, as there is no 
>>>>>> significant difference between 1.2.x and 1.3.x in relation to memory 
>>>>>> usage. 
>>>>>> And you'd certainly see OOMs in node logs if it was a memory issue.
>>>>>>
>>>>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:
>>>>>>>
>>>>>>> There is nothing (literally) in the log of either data node after 
>>>>>>> the node joined events and nothing in the master log between index 
>>>>>>> recovery 
>>>>>>> and the first error message.
>>>>>>>
>>>>>>> There are 0 queries run before the errors start occurring (access to 
>>>>>>> the nodes is blocked via a firewall, so the only communications are 
>>>>>>> between 
>>>>>>> the nodes). We have 50% of the RAM allocated to the heap on each node 
>>>>>>> (4GB 
>>>>>>> each).
>>>>>>>
>>>>>>> This cluster operated without issue under 1.1.2. Did something 
>>>>>>> change between 1.1.2 and 1.3.5 that drastically increased idle heap 
>>>>>>> requirements?
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:
>>>>>>>>
>>>>>>>> Generally ReceiveTimeoutTransportException is due to network 
>>>>>>>> disconnects or a node failing to respond due to heavy load. What does 
>>>>>>>> the 
>>>>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap 
>>>>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>>>>>>>>
>>>>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ES Version: 1.3.5
>>>>>>>>>
>>>>>>>>> OS: Ubuntu 14.04.1 LTS
>>>>>>>>>
>>>>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at 
>>>>>>>>> AWS
>>>>>>>>>
>>>>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *After upgrading from ES 1.1.2...*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Startup ES on master
>>>>>>>>> 2. All nodes join cluster
>>>>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ] 
>>>>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
>>>>>>>>> 4. Checked health a few times
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> curl -XGET localhost:9200/_cat/health?v
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
>>>>>>>>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] 
>>>>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node 
>>>>>>>>> [pYi3z5PgRh6msJX_armz_A]
>>>>>>>>>
>>>>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
>>>>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
>>>>>>>>>  
>>>>>>>>> request_id [17564] timed out after [15001ms]
>>>>>>>>>
>>>>>>>>> at org.elasticsearch.transport.TransportService$
>>>>>>>>> TimeoutHandler.run(TransportService.java:356)
>>>>>>>>>
>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>>>>>> ThreadPoolExecutor.java:1145)
>>>>>>>>>
>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>>>>>>> ThreadPoolExecutor.java:615)
>>>>>>>>>
>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported for 
>>>>>>>>> one or more of the data nodes
>>>>>>>>>
>>>>>>>>> 7. During this time, queries (search, index, etc.) don’t return. 
>>>>>>>>> They hang until the error state temporarily resolves itself (a 
>>>>>>>>> varying time 
>>>>>>>>> around 15-20 minutes) at which point the expected result is returned.
>>>>>>>>>
>>>>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to