Re: Startup issues with ES 1.3.5

Chris Moore Fri, 05 Dec 2014 13:52:55 -0800

As a followup, I closed all the indices on the cluster. I would then open 1 
index and optimize it down to 1 segment. I made it through ~60% of the 
indices (and probably ~45% of the data) before the same errors showed up in 
the master log and the same behavior resumed.


On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:
>
> I replied once, but it seems to have disappeared, so if this gets double 
> posted, I'm sorry.
>
> We disabled all monitoring when we started looking into the issues to 
> ensure there was no external load on ES. Everything we are currently seeing 
> is just whatever activity ES generates internally.
>
> My understanding regarding optimizing indices is that you shouldn't call 
> it explicitly on indices that are regularly updating, rather you should let 
> the background merge process handle things. As the majority of our indices 
> regularly update, we don't explicitly call optimize on them. I can try to 
> call it on them all and see if it helps.
>
> As for disk speed, we are currently running ES on SSDs. We have it in our 
> roadmap to change that to RAIDed SSDs, but it hasn't been a priority as we 
> have been getting acceptable performance thus far.
>
> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:
>>
>> Do you have a monitor tool running?
>>
>> I recommend to switch it off, and optimize your indices, and then update 
>> your monitoring tools.
>>
>> Seems you have many segments/slow disk to get them reported in 15s.
>>
>> Jörg
>> Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>:
>>
>>> This is running on Amazon EC2 in a VPC on dedicated instances. Physical 
>>> network infrastructure is likely fine. Are there specific network issues 
>>> you think we should look into?
>>>
>>> When we are in a problem state, we can communicate between the nodes 
>>> just fine. I can run curl requests to ES (health checks, etc) from the 
>>> master node to the data nodes directly and they return as expected. So, 
>>> there doesn't seem to be a socket exhaustion issue (additionally there are 
>>> no kernel errors being reported).
>>>
>>> It feels like there is a queue/buffer filling up somewhere that once it 
>>> has availability again, things start working. But, /_cat/thread_pool?v 
>>> doesn't show anything above 0 (although, when we are in the problem state, 
>>> it doesn't return a response if run on master), nodes/hot_threads doesn't 
>>> show anything going on, etc.
>>>
>>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:
>>>>
>>>> I would think the network is a prime suspect then, as there is no 
>>>> significant difference between 1.2.x and 1.3.x in relation to memory 
>>>> usage. 
>>>> And you'd certainly see OOMs in node logs if it was a memory issue.
>>>>
>>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:
>>>>>
>>>>> There is nothing (literally) in the log of either data node after the 
>>>>> node joined events and nothing in the master log between index recovery 
>>>>> and 
>>>>> the first error message.
>>>>>
>>>>> There are 0 queries run before the errors start occurring (access to 
>>>>> the nodes is blocked via a firewall, so the only communications are 
>>>>> between 
>>>>> the nodes). We have 50% of the RAM allocated to the heap on each node 
>>>>> (4GB 
>>>>> each).
>>>>>
>>>>> This cluster operated without issue under 1.1.2. Did something change 
>>>>> between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?
>>>>>
>>>>>
>>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:
>>>>>>
>>>>>> Generally ReceiveTimeoutTransportException is due to network 
>>>>>> disconnects or a node failing to respond due to heavy load. What does 
>>>>>> the 
>>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap 
>>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>>>>>>
>>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:
>>>>>>>
>>>>>>>
>>>>>>> ES Version: 1.3.5
>>>>>>>
>>>>>>> OS: Ubuntu 14.04.1 LTS
>>>>>>>
>>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS
>>>>>>>
>>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>>>>>>
>>>>>>>
>>>>>>> *After upgrading from ES 1.1.2...*
>>>>>>>
>>>>>>>
>>>>>>> 1. Startup ES on master
>>>>>>> 2. All nodes join cluster
>>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ] 
>>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
>>>>>>> 4. Checked health a few times
>>>>>>>
>>>>>>>
>>>>>>> curl -XGET localhost:9200/_cat/health?v
>>>>>>>
>>>>>>>
>>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
>>>>>>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>>>>>>
>>>>>>>
>>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] 
>>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node 
>>>>>>> [pYi3z5PgRh6msJX_armz_A]
>>>>>>>
>>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
>>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
>>>>>>>  
>>>>>>> request_id [17564] timed out after [15001ms]
>>>>>>>
>>>>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
>>>>>>> TransportService.java:356)
>>>>>>>
>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>>>> ThreadPoolExecutor.java:1145)
>>>>>>>
>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>>>>> ThreadPoolExecutor.java:615)
>>>>>>>
>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>
>>>>>>>
>>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported for 
>>>>>>> one or more of the data nodes
>>>>>>>
>>>>>>> 7. During this time, queries (search, index, etc.) don’t return. 
>>>>>>> They hang until the error state temporarily resolves itself (a varying 
>>>>>>> time 
>>>>>>> around 15-20 minutes) at which point the expected result is returned.
>>>>>>>
>>>>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e52e7761-e2a9-4567-ab62-84f52f353818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Startup issues with ES 1.3.5

Reply via email to