Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then update
your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>:

> This is running on Amazon EC2 in a VPC on dedicated instances. Physical
> network infrastructure is likely fine. Are there specific network issues
> you think we should look into?
>
> When we are in a problem state, we can communicate between the nodes just
> fine. I can run curl requests to ES (health checks, etc) from the master
> node to the data nodes directly and they return as expected. So, there
> doesn't seem to be a socket exhaustion issue (additionally there are no
> kernel errors being reported).
>
> It feels like there is a queue/buffer filling up somewhere that once it
> has availability again, things start working. But, /_cat/thread_pool?v
> doesn't show anything above 0 (although, when we are in the problem state,
> it doesn't return a response if run on master), nodes/hot_threads doesn't
> show anything going on, etc.
>
> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:
>>
>> I would think the network is a prime suspect then, as there is no
>> significant difference between 1.2.x and 1.3.x in relation to memory usage.
>> And you'd certainly see OOMs in node logs if it was a memory issue.
>>
>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:
>>>
>>> There is nothing (literally) in the log of either data node after the
>>> node joined events and nothing in the master log between index recovery and
>>> the first error message.
>>>
>>> There are 0 queries run before the errors start occurring (access to the
>>> nodes is blocked via a firewall, so the only communications are between the
>>> nodes). We have 50% of the RAM allocated to the heap on each node (4GB
>>> each).
>>>
>>> This cluster operated without issue under 1.1.2. Did something change
>>> between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?
>>>
>>>
>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:
>>>>
>>>> Generally ReceiveTimeoutTransportException is due to network
>>>> disconnects or a node failing to respond due to heavy load. What does the
>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>>>>
>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:
>>>>>
>>>>>
>>>>> ES Version: 1.3.5
>>>>>
>>>>> OS: Ubuntu 14.04.1 LTS
>>>>>
>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS
>>>>>
>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>>>>
>>>>>
>>>>> *After upgrading from ES 1.1.2...*
>>>>>
>>>>>
>>>>> 1. Startup ES on master
>>>>> 2. All nodes join cluster
>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ]
>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
>>>>> 4. Checked health a few times
>>>>>
>>>>>
>>>>> curl -XGET localhost:9200/_cat/health?v
>>>>>
>>>>>
>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the
>>>>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>>>>
>>>>>
>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node
>>>>> [pYi3z5PgRh6msJX_armz_A]
>>>>>
>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException:
>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
>>>>> request_id [17564] timed out after [15001ms]
>>>>>
>>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
>>>>> TransportService.java:356)
>>>>>
>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>> ThreadPoolExecutor.java:1145)
>>>>>
>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>>> ThreadPoolExecutor.java:615)
>>>>>
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>>
>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported for one
>>>>> or more of the data nodes
>>>>>
>>>>> 7. During this time, queries (search, index, etc.) don’t return. They
>>>>> hang until the error state temporarily resolves itself (a varying time
>>>>> around 15-20 minutes) at which point the expected result is returned.
>>>>>
>>>>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHs80VBv%2BBz0G6bQKWHZd-gG8G2aXmiS39OFWaEW1su4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to