Do you have a monitor tool running? I recommend to switch it off, and optimize your indices, and then update your monitoring tools.
Seems you have many segments/slow disk to get them reported in 15s. Jörg Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>: > This is running on Amazon EC2 in a VPC on dedicated instances. Physical > network infrastructure is likely fine. Are there specific network issues > you think we should look into? > > When we are in a problem state, we can communicate between the nodes just > fine. I can run curl requests to ES (health checks, etc) from the master > node to the data nodes directly and they return as expected. So, there > doesn't seem to be a socket exhaustion issue (additionally there are no > kernel errors being reported). > > It feels like there is a queue/buffer filling up somewhere that once it > has availability again, things start working. But, /_cat/thread_pool?v > doesn't show anything above 0 (although, when we are in the problem state, > it doesn't return a response if run on master), nodes/hot_threads doesn't > show anything going on, etc. > > On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote: >> >> I would think the network is a prime suspect then, as there is no >> significant difference between 1.2.x and 1.3.x in relation to memory usage. >> And you'd certainly see OOMs in node logs if it was a memory issue. >> >> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote: >>> >>> There is nothing (literally) in the log of either data node after the >>> node joined events and nothing in the master log between index recovery and >>> the first error message. >>> >>> There are 0 queries run before the errors start occurring (access to the >>> nodes is blocked via a firewall, so the only communications are between the >>> nodes). We have 50% of the RAM allocated to the heap on each node (4GB >>> each). >>> >>> This cluster operated without issue under 1.1.2. Did something change >>> between 1.1.2 and 1.3.5 that drastically increased idle heap requirements? >>> >>> >>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote: >>>> >>>> Generally ReceiveTimeoutTransportException is due to network >>>> disconnects or a node failing to respond due to heavy load. What does the >>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap >>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB >>>> >>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote: >>>>> >>>>> >>>>> ES Version: 1.3.5 >>>>> >>>>> OS: Ubuntu 14.04.1 LTS >>>>> >>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS >>>>> >>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20) >>>>> >>>>> >>>>> *After upgrading from ES 1.1.2...* >>>>> >>>>> >>>>> 1. Startup ES on master >>>>> 2. All nodes join cluster >>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway ] >>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state >>>>> 4. Checked health a few times >>>>> >>>>> >>>>> curl -XGET localhost:9200/_cat/health?v >>>>> >>>>> >>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the >>>>> recovery finishes), the log on the master node (10.0.1.18) reports: >>>>> >>>>> >>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] >>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node >>>>> [pYi3z5PgRh6msJX_armz_A] >>>>> >>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: >>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n] >>>>> request_id [17564] timed out after [15001ms] >>>>> >>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler.run( >>>>> TransportService.java:356) >>>>> >>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>>>> ThreadPoolExecutor.java:1145) >>>>> >>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>>>> ThreadPoolExecutor.java:615) >>>>> >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> >>>>> >>>>> 6. Every 30 seconds or 60 seconds, the above error is reported for one >>>>> or more of the data nodes >>>>> >>>>> 7. During this time, queries (search, index, etc.) don’t return. They >>>>> hang until the error state temporarily resolves itself (a varying time >>>>> around 15-20 minutes) at which point the expected result is returned. >>>>> >>>>> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHs80VBv%2BBz0G6bQKWHZd-gG8G2aXmiS39OFWaEW1su4A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
