Re: Startup issues with ES 1.3.5

Chris Moore Fri, 05 Dec 2014 12:57:33 -0800

I replied once, but it seems to have disappeared, so if this gets double 
posted, I'm sorry.


We disabled all monitoring when we started looking into the issues to 
ensure there was no external load on ES. Everything we are currently seeing 
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't call it 
explicitly on indices that are regularly updating, rather you should let 
the background merge process handle things. As the majority of our indices 
regularly update, we don't explicitly call optimize on them. I can try to 
call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in our 
roadmap to change that to RAIDed SSDs, but it hasn't been a priority as we 
have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:
>
> Do you have a monitor tool running?
>
> I recommend to switch it off, and optimize your indices, and then update 
> your monitoring tools.
>
> Seems you have many segments/slow disk to get them reported in 15s.
>
> Jörg
> Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected] 
> <javascript:>>:
>
>> This is running on Amazon EC2 in a VPC on dedicated instances. Physical 
>> network infrastructure is likely fine. Are there specific network issues 
>> you think we should look into?
>>
>> When we are in a problem state, we can communicate between the nodes just 
>> fine. I can run curl requests to ES (health checks, etc) from the master 
>> node to the data nodes directly and they return as expected. So, there 
>> doesn't seem to be a socket exhaustion issue (additionally there are no 
>> kernel errors being reported).
>>
>> It feels like there is a queue/buffer filling up somewhere that once it 
>> has availability again, things start working. But, /_cat/thread_pool?v 
>> doesn't show anything above 0 (although, when we are in the problem state, 
>> it doesn't return a response if run on master), nodes/hot_threads doesn't 
>> show anything going on, etc.
>>
>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:
>>>
>>> I would think the network is a prime suspect then, as there is no 
>>> significant difference between 1.2.x and 1.3.x in relation to memory usage. 
>>> And you'd certainly see OOMs in node logs if it was a memory issue.
>>>
>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:
>>>>
>>>> There is nothing (literally) in the log of either data node after the 
>>>> node joined events and nothing in the master log between index recovery 
>>>> and 
>>>> the first error message.
>>>>
>>>> There are 0 queries run before the errors start occurring (access to 
>>>> the nodes is blocked via a firewall, so the only communications are 
>>>> between 
>>>> the nodes). We have 50% of the RAM allocated to the heap on each node (4GB 
>>>> each).
>>>>
>>>> This cluster operated without issue under 1.1.2. Did something change 
>>>> between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?
>>>>
>>>>
>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:
>>>>>
>>>>> Generally ReceiveTimeoutTransportException is due to network 
>>>>> disconnects or a node failing to respond due to heavy load. What does the 
>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap 
>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>>>>>
>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:
>>>>>>
>>>>>>
>>>>>> ES Version: 1.3.5
>>>>>>
>>>>>> OS: Ubuntu 14.04.1 LTS
>>>>>>
>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS
>>>>>>
>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>>>>>
>>>>>>
>>>>>> *After upgrading from ES 1.1.2...*
>>>>>>
>>>>>>
>>>>>> 1. Startup ES on master
>>>>>> 2. All nodes join cluster
>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ] 
>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
>>>>>> 4. Checked health a few times
>>>>>>
>>>>>>
>>>>>> curl -XGET localhost:9200/_cat/health?v
>>>>>>
>>>>>>
>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
>>>>>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>>>>>
>>>>>>
>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] 
>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node 
>>>>>> [pYi3z5PgRh6msJX_armz_A]
>>>>>>
>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
>>>>>>  
>>>>>> request_id [17564] timed out after [15001ms]
>>>>>>
>>>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
>>>>>> TransportService.java:356)
>>>>>>
>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>>> ThreadPoolExecutor.java:1145)
>>>>>>
>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>>>> ThreadPoolExecutor.java:615)
>>>>>>
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>>
>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported for 
>>>>>> one or more of the data nodes
>>>>>>
>>>>>> 7. During this time, queries (search, index, etc.) don’t return. They 
>>>>>> hang until the error state temporarily resolves itself (a varying time 
>>>>>> around 15-20 minutes) at which point the expected result is returned.
>>>>>>
>>>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/32c75f79-1326-47c4-8d28-e28361873ed6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Startup issues with ES 1.3.5

Reply via email to