Can you elaborate on your dataset and structure; how many indexes, how many shards, how big they are etc.
On 24 December 2014 at 07:36, Chris Moore <[email protected]> wrote: > Updating again: > > If we reduce the number of shards per node to below ~350, the system > operates fine. Once we go above that (number_of_indices * > number_of_shards_per_index * number_of_replicas / number_of_nodes), we > start running into the described issues. > > On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote: >> >> Just a quick update, we duplicated our test environment to see if this >> issue was fixed by upgrading to 1.4.1 instead. We received the same errors >> under 1.4.1. >> >> On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote: >>> >>> As a followup, I closed all the indices on the cluster. I would then >>> open 1 index and optimize it down to 1 segment. I made it through ~60% of >>> the indices (and probably ~45% of the data) before the same errors showed >>> up in the master log and the same behavior resumed. >>> >>> On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote: >>>> >>>> I replied once, but it seems to have disappeared, so if this gets >>>> double posted, I'm sorry. >>>> >>>> We disabled all monitoring when we started looking into the issues to >>>> ensure there was no external load on ES. Everything we are currently seeing >>>> is just whatever activity ES generates internally. >>>> >>>> My understanding regarding optimizing indices is that you shouldn't >>>> call it explicitly on indices that are regularly updating, rather you >>>> should let the background merge process handle things. As the majority of >>>> our indices regularly update, we don't explicitly call optimize on them. I >>>> can try to call it on them all and see if it helps. >>>> >>>> As for disk speed, we are currently running ES on SSDs. We have it in >>>> our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as >>>> we have been getting acceptable performance thus far. >>>> >>>> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote: >>>>> >>>>> Do you have a monitor tool running? >>>>> >>>>> I recommend to switch it off, and optimize your indices, and then >>>>> update your monitoring tools. >>>>> >>>>> Seems you have many segments/slow disk to get them reported in 15s. >>>>> >>>>> Jörg >>>>> Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>: >>>>> >>>>>> This is running on Amazon EC2 in a VPC on dedicated instances. >>>>>> Physical network infrastructure is likely fine. Are there specific >>>>>> network >>>>>> issues you think we should look into? >>>>>> >>>>>> When we are in a problem state, we can communicate between the nodes >>>>>> just fine. I can run curl requests to ES (health checks, etc) from the >>>>>> master node to the data nodes directly and they return as expected. So, >>>>>> there doesn't seem to be a socket exhaustion issue (additionally there >>>>>> are >>>>>> no kernel errors being reported). >>>>>> >>>>>> It feels like there is a queue/buffer filling up somewhere that once >>>>>> it has availability again, things start working. But, /_cat/thread_pool?v >>>>>> doesn't show anything above 0 (although, when we are in the problem >>>>>> state, >>>>>> it doesn't return a response if run on master), nodes/hot_threads doesn't >>>>>> show anything going on, etc. >>>>>> >>>>>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote: >>>>>>> >>>>>>> I would think the network is a prime suspect then, as there is no >>>>>>> significant difference between 1.2.x and 1.3.x in relation to memory >>>>>>> usage. >>>>>>> And you'd certainly see OOMs in node logs if it was a memory issue. >>>>>>> >>>>>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote: >>>>>>>> >>>>>>>> There is nothing (literally) in the log of either data node after >>>>>>>> the node joined events and nothing in the master log between index >>>>>>>> recovery >>>>>>>> and the first error message. >>>>>>>> >>>>>>>> There are 0 queries run before the errors start occurring (access >>>>>>>> to the nodes is blocked via a firewall, so the only communications are >>>>>>>> between the nodes). We have 50% of the RAM allocated to the heap on >>>>>>>> each >>>>>>>> node (4GB each). >>>>>>>> >>>>>>>> This cluster operated without issue under 1.1.2. Did something >>>>>>>> change between 1.1.2 and 1.3.5 that drastically increased idle heap >>>>>>>> requirements? >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Generally ReceiveTimeoutTransportException is due to network >>>>>>>>> disconnects or a node failing to respond due to heavy load. What does >>>>>>>>> the >>>>>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap >>>>>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB >>>>>>>>> >>>>>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ES Version: 1.3.5 >>>>>>>>>> >>>>>>>>>> OS: Ubuntu 14.04.1 LTS >>>>>>>>>> >>>>>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at >>>>>>>>>> AWS >>>>>>>>>> >>>>>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *After upgrading from ES 1.1.2...* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. Startup ES on master >>>>>>>>>> 2. All nodes join cluster >>>>>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway ] >>>>>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into >>>>>>>>>> cluster_state >>>>>>>>>> 4. Checked health a few times >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> curl -XGET localhost:9200/_cat/health?v >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the >>>>>>>>>> recovery finishes), the log on the master node (10.0.1.18) reports: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] >>>>>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node >>>>>>>>>> [pYi3z5PgRh6msJX_armz_A] >>>>>>>>>> >>>>>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: >>>>>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n] >>>>>>>>>> request_id [17564] timed out after [15001ms] >>>>>>>>>> >>>>>>>>>> at org.elasticsearch.transport.TransportService$TimeoutHandler. >>>>>>>>>> run(TransportService.java:356) >>>>>>>>>> >>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>>>>>>>> Executor.java:1145) >>>>>>>>>> >>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>>>>>>>> lExecutor.java:615) >>>>>>>>>> >>>>>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported >>>>>>>>>> for one or more of the data nodes >>>>>>>>>> >>>>>>>>>> 7. During this time, queries (search, index, etc.) don’t return. >>>>>>>>>> They hang until the error state temporarily resolves itself (a >>>>>>>>>> varying time >>>>>>>>>> around 15-20 minutes) at which point the expected result is returned. >>>>>>>>>> >>>>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2% >>>>>> 40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9%2BHte3%3DV1Qxg-yP%3D2Siqd734RnemESX1ZNJ%3DrKjCt%3D8Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
