We tried many different test setups yesterday. The first setup we tried was:
1 Master, 2 Data nodes 38 indices 10 shards per index 1 replica per index 760 total shards (380 primary, 760 total) Each index had 2,745 documents Each index was 218.9kb in size (according to the _cat/indices API) We realize that 10 shards per index with only 2 nodes is not a good idea, so we changed that and reran the tests. We changed shards per index to the default of 5 and put 100 indices on the 2 boxes and ran into the same issue. It was the same dataset, so all other size information is correct. After that, we turned off one of the data nodes, set replicas to 0 and shards per index to 1. With the same dataset, I loaded ~440 indices and ran into the timeout issues with the Master and Data nodes just idling. This is just a test dataset that we came up with to quickly test our issues that contains no confidential information. Once we figure out the issues affecting this test dataset, we'll try things with our real dataset. All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our current test version). We have also tried our real setup on 1.4.1 to no avail. On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote: > > Can you elaborate on your dataset and structure; how many indexes, how > many shards, how big they are etc. > > On 24 December 2014 at 07:36, Chris Moore <[email protected] > <javascript:>> wrote: > >> Updating again: >> >> If we reduce the number of shards per node to below ~350, the system >> operates fine. Once we go above that (number_of_indices * >> number_of_shards_per_index * number_of_replicas / number_of_nodes), we >> start running into the described issues. >> >> On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote: >>> >>> Just a quick update, we duplicated our test environment to see if this >>> issue was fixed by upgrading to 1.4.1 instead. We received the same errors >>> under 1.4.1. >>> >>> On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote: >>>> >>>> As a followup, I closed all the indices on the cluster. I would then >>>> open 1 index and optimize it down to 1 segment. I made it through ~60% of >>>> the indices (and probably ~45% of the data) before the same errors showed >>>> up in the master log and the same behavior resumed. >>>> >>>> On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote: >>>>> >>>>> I replied once, but it seems to have disappeared, so if this gets >>>>> double posted, I'm sorry. >>>>> >>>>> We disabled all monitoring when we started looking into the issues to >>>>> ensure there was no external load on ES. Everything we are currently >>>>> seeing >>>>> is just whatever activity ES generates internally. >>>>> >>>>> My understanding regarding optimizing indices is that you shouldn't >>>>> call it explicitly on indices that are regularly updating, rather you >>>>> should let the background merge process handle things. As the majority of >>>>> our indices regularly update, we don't explicitly call optimize on them. >>>>> I >>>>> can try to call it on them all and see if it helps. >>>>> >>>>> As for disk speed, we are currently running ES on SSDs. We have it in >>>>> our roadmap to change that to RAIDed SSDs, but it hasn't been a priority >>>>> as >>>>> we have been getting acceptable performance thus far. >>>>> >>>>> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote: >>>>>> >>>>>> Do you have a monitor tool running? >>>>>> >>>>>> I recommend to switch it off, and optimize your indices, and then >>>>>> update your monitoring tools. >>>>>> >>>>>> Seems you have many segments/slow disk to get them reported in 15s. >>>>>> >>>>>> Jörg >>>>>> Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>: >>>>>> >>>>>>> This is running on Amazon EC2 in a VPC on dedicated instances. >>>>>>> Physical network infrastructure is likely fine. Are there specific >>>>>>> network >>>>>>> issues you think we should look into? >>>>>>> >>>>>>> When we are in a problem state, we can communicate between the nodes >>>>>>> just fine. I can run curl requests to ES (health checks, etc) from the >>>>>>> master node to the data nodes directly and they return as expected. So, >>>>>>> there doesn't seem to be a socket exhaustion issue (additionally there >>>>>>> are >>>>>>> no kernel errors being reported). >>>>>>> >>>>>>> It feels like there is a queue/buffer filling up somewhere that once >>>>>>> it has availability again, things start working. But, >>>>>>> /_cat/thread_pool?v >>>>>>> doesn't show anything above 0 (although, when we are in the problem >>>>>>> state, >>>>>>> it doesn't return a response if run on master), nodes/hot_threads >>>>>>> doesn't >>>>>>> show anything going on, etc. >>>>>>> >>>>>>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote: >>>>>>>> >>>>>>>> I would think the network is a prime suspect then, as there is no >>>>>>>> significant difference between 1.2.x and 1.3.x in relation to memory >>>>>>>> usage. >>>>>>>> And you'd certainly see OOMs in node logs if it was a memory issue. >>>>>>>> >>>>>>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote: >>>>>>>>> >>>>>>>>> There is nothing (literally) in the log of either data node after >>>>>>>>> the node joined events and nothing in the master log between index >>>>>>>>> recovery >>>>>>>>> and the first error message. >>>>>>>>> >>>>>>>>> There are 0 queries run before the errors start occurring (access >>>>>>>>> to the nodes is blocked via a firewall, so the only communications >>>>>>>>> are >>>>>>>>> between the nodes). We have 50% of the RAM allocated to the heap on >>>>>>>>> each >>>>>>>>> node (4GB each). >>>>>>>>> >>>>>>>>> This cluster operated without issue under 1.1.2. Did something >>>>>>>>> change between 1.1.2 and 1.3.5 that drastically increased idle heap >>>>>>>>> requirements? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Generally ReceiveTimeoutTransportException is due to network >>>>>>>>>> disconnects or a node failing to respond due to heavy load. What >>>>>>>>>> does the >>>>>>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little >>>>>>>>>> heap >>>>>>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB >>>>>>>>>> >>>>>>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ES Version: 1.3.5 >>>>>>>>>>> >>>>>>>>>>> OS: Ubuntu 14.04.1 LTS >>>>>>>>>>> >>>>>>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM >>>>>>>>>>> at AWS >>>>>>>>>>> >>>>>>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *After upgrading from ES 1.1.2...* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1. Startup ES on master >>>>>>>>>>> 2. All nodes join cluster >>>>>>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway ] >>>>>>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into >>>>>>>>>>> cluster_state >>>>>>>>>>> 4. Checked health a few times >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> curl -XGET localhost:9200/_cat/health?v >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after >>>>>>>>>>> the recovery finishes), the log on the master node (10.0.1.18) >>>>>>>>>>> reports: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] >>>>>>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node >>>>>>>>>>> [pYi3z5PgRh6msJX_armz_A] >>>>>>>>>>> >>>>>>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: >>>>>>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n] >>>>>>>>>>> >>>>>>>>>>> request_id [17564] timed out after [15001ms] >>>>>>>>>>> >>>>>>>>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler. >>>>>>>>>>> run(TransportService.java:356) >>>>>>>>>>> >>>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>>>>>>>>> Executor.java:1145) >>>>>>>>>>> >>>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>>>>>>>>> lExecutor.java:615) >>>>>>>>>>> >>>>>>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported >>>>>>>>>>> for one or more of the data nodes >>>>>>>>>>> >>>>>>>>>>> 7. During this time, queries (search, index, etc.) don’t return. >>>>>>>>>>> They hang until the error state temporarily resolves itself (a >>>>>>>>>>> varying time >>>>>>>>>>> around 15-20 minutes) at which point the expected result is >>>>>>>>>>> returned. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/99a45801- >>>>>>> 2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b243d87-867b-4c48-b134-02f28735d4de%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
