Re: Startup issues with ES 1.3.5

Mark Walkom Tue, 23 Dec 2014 14:03:44 -0800

Can you elaborate on your dataset and structure; how many indexes, how many
shards, how big they are etc.


On 24 December 2014 at 07:36, Chris Moore <[email protected]> wrote:

> Updating again:
>
> If we reduce the number of shards per node to below ~350, the system
> operates fine. Once we go above that (number_of_indices *
> number_of_shards_per_index * number_of_replicas / number_of_nodes), we
> start running into the described issues.
>
> On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:
>>
>> Just a quick update, we duplicated our test environment to see if this
>> issue was fixed by upgrading to 1.4.1 instead. We received the same errors
>> under 1.4.1.
>>
>> On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:
>>>
>>> As a followup, I closed all the indices on the cluster. I would then
>>> open 1 index and optimize it down to 1 segment. I made it through ~60% of
>>> the indices (and probably ~45% of the data) before the same errors showed
>>> up in the master log and the same behavior resumed.
>>>
>>> On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:
>>>>
>>>> I replied once, but it seems to have disappeared, so if this gets
>>>> double posted, I'm sorry.
>>>>
>>>> We disabled all monitoring when we started looking into the issues to
>>>> ensure there was no external load on ES. Everything we are currently seeing
>>>> is just whatever activity ES generates internally.
>>>>
>>>> My understanding regarding optimizing indices is that you shouldn't
>>>> call it explicitly on indices that are regularly updating, rather you
>>>> should let the background merge process handle things. As the majority of
>>>> our indices regularly update, we don't explicitly call optimize on them. I
>>>> can try to call it on them all and see if it helps.
>>>>
>>>> As for disk speed, we are currently running ES on SSDs. We have it in
>>>> our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as
>>>> we have been getting acceptable performance thus far.
>>>>
>>>> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:
>>>>>
>>>>> Do you have a monitor tool running?
>>>>>
>>>>> I recommend to switch it off, and optimize your indices, and then
>>>>> update your monitoring tools.
>>>>>
>>>>> Seems you have many segments/slow disk to get them reported in 15s.
>>>>>
>>>>> Jörg
>>>>> Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>:
>>>>>
>>>>>> This is running on Amazon EC2 in a VPC on dedicated instances.
>>>>>> Physical network infrastructure is likely fine. Are there specific 
>>>>>> network
>>>>>> issues you think we should look into?
>>>>>>
>>>>>> When we are in a problem state, we can communicate between the nodes
>>>>>> just fine. I can run curl requests to ES (health checks, etc) from the
>>>>>> master node to the data nodes directly and they return as expected. So,
>>>>>> there doesn't seem to be a socket exhaustion issue (additionally there 
>>>>>> are
>>>>>> no kernel errors being reported).
>>>>>>
>>>>>> It feels like there is a queue/buffer filling up somewhere that once
>>>>>> it has availability again, things start working. But, /_cat/thread_pool?v
>>>>>> doesn't show anything above 0 (although, when we are in the problem 
>>>>>> state,
>>>>>> it doesn't return a response if run on master), nodes/hot_threads doesn't
>>>>>> show anything going on, etc.
>>>>>>
>>>>>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:
>>>>>>>
>>>>>>> I would think the network is a prime suspect then, as there is no
>>>>>>> significant difference between 1.2.x and 1.3.x in relation to memory 
>>>>>>> usage.
>>>>>>> And you'd certainly see OOMs in node logs if it was a memory issue.
>>>>>>>
>>>>>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:
>>>>>>>>
>>>>>>>> There is nothing (literally) in the log of either data node after
>>>>>>>> the node joined events and nothing in the master log between index 
>>>>>>>> recovery
>>>>>>>> and the first error message.
>>>>>>>>
>>>>>>>> There are 0 queries run before the errors start occurring (access
>>>>>>>> to the nodes is blocked via a firewall, so the only communications are
>>>>>>>> between the nodes). We have 50% of the RAM allocated to the heap on 
>>>>>>>> each
>>>>>>>> node (4GB each).
>>>>>>>>
>>>>>>>> This cluster operated without issue under 1.1.2. Did something
>>>>>>>> change between 1.1.2 and 1.3.5 that drastically increased idle heap
>>>>>>>> requirements?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Generally ReceiveTimeoutTransportException is due to network
>>>>>>>>> disconnects or a node failing to respond due to heavy load. What does 
>>>>>>>>> the
>>>>>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
>>>>>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>>>>>>>>>
>>>>>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ES Version: 1.3.5
>>>>>>>>>>
>>>>>>>>>> OS: Ubuntu 14.04.1 LTS
>>>>>>>>>>
>>>>>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
>>>>>>>>>> AWS
>>>>>>>>>>
>>>>>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *After upgrading from ES 1.1.2...*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1. Startup ES on master
>>>>>>>>>> 2. All nodes join cluster
>>>>>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ]
>>>>>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into 
>>>>>>>>>> cluster_state
>>>>>>>>>> 4. Checked health a few times
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> curl -XGET localhost:9200/_cat/health?v
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the
>>>>>>>>>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
>>>>>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node
>>>>>>>>>> [pYi3z5PgRh6msJX_armz_A]
>>>>>>>>>>
>>>>>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException:
>>>>>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
>>>>>>>>>> request_id [17564] timed out after [15001ms]
>>>>>>>>>>
>>>>>>>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler.
>>>>>>>>>> run(TransportService.java:356)
>>>>>>>>>>
>>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>>>>>> Executor.java:1145)
>>>>>>>>>>
>>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>>>>>> lExecutor.java:615)
>>>>>>>>>>
>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported
>>>>>>>>>> for one or more of the data nodes
>>>>>>>>>>
>>>>>>>>>> 7. During this time, queries (search, index, etc.) don’t return.
>>>>>>>>>> They hang until the error state temporarily resolves itself (a 
>>>>>>>>>> varying time
>>>>>>>>>> around 15-20 minutes) at which point the expected result is returned.
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%
>>>>>> 40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9%2BHte3%3DV1Qxg-yP%3D2Siqd734RnemESX1ZNJ%3DrKjCt%3D8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Startup issues with ES 1.3.5

Reply via email to