Master and data are separate nodes. The problem node (master) never leaves 
the cluster (there are no messages in the logs of the other nodes and 
/_cat/health reports it is still there). It will respond to requests that 
don't require checking with other nodes for any data (so _cat/health is 
fine but /_search is not). Detaching jstack does not fix that behavior.

On Friday, December 26, 2014 10:55:52 AM UTC-5, Gurvinder Singh wrote:
>
> Do you have master and data node separate or they running on same ES 
> node process. Another thing does after jstack, process becomes 
> responsive again or it still remains out of cluster. 
>
> On 12/26/2014 04:43 PM, Chris Moore wrote: 
> > I tried your configuration suggestions, but the behavior was no 
> > different. I have attached the jstack output from the troubled node 
> > (master). It didn't appear to indicate anything of note, but I have 
> > attached it. 
> > 
> > On Thursday, December 25, 2014 8:33:20 AM UTC-5, Gurvinder Singh wrote: 
> > 
> >     We might have faced similar problem with ES 1.3.6. The reason we 
> found 
> >     was might be due to concurrent merges. These settings have helped us 
> >     in fixing the issue. 
> >     merge: 
> >         policy: 
> >           max_merge_at_once: 5 
> >           reclaim_deletes_weight: 4.0 
> >           segments_per_tier: 5 
> >     indices: 
> >       store: 
> >         throttle: 
> >           max_bytes_per_sec: 40mb # as we have few SATA disk for storage 
> >           type: merge 
> > 
> >     you can check your hanged process by attaching jstack to it as 
> > 
> >     jstack -F <pid> 
> > 
> >     Also once you detach the jstack process become responding again and 
> >     joins cluster.  Although it should not happen at all as if disk is 
> the 
> >     limitation ES should not stop responding. 
> > 
> >     - Gurvinder 
> >     On 12/24/2014 08:00 PM, Mark Walkom wrote: 
> >     > Ok a few things that don't make sense to me; 
> >     > 
> >     > 1. 10 indexes of only ~220Kb? Are you sure of this? 2. If so why 
> >     > not just one index? 3. Is baseball_data.json the data for an 
> entire 
> >     > index? If not can you clarify. 4. What java version are you on? 5. 
> >     > What monitoring were you using? 6. Can you delete all your data, 
> >     > switch monitoring on, start reindexing and then watch what 
> happens? 
> >     > Marvel would be ideal for this. 
> >     > 
> >     > What you are seeing is really, really weird. That is a high shard 
> >     > count however given the dataset is small I wouldn't think it'd 
> >     > cause problems (but I could be wrong). 
> >     > 
> >     > On 25 December 2014 at 02:27, Chris Moore <[email protected] 
> >     <javascript:> 
> >     > <mailto:[email protected] <javascript:>>> wrote: 
> >     > 
> >     > Attached is the script we've been using to load the data and the 
> >     > dataset. This is the mapping and a sample document 
> >     > 
> >     > { "baseball_1" : { "mappings" : { "team" : { "properties" : { "L" 
> : 
> >     > { "type" : "integer", "store" : true }, "W" : { "type" : 
> >     > "integer", "store" : true }, "name" : { "type" : "string", "store" 
> >     > : true }, "teamID" : { "type" : "string", "store" : true }, 
> >     > "yearID" : { "type" : "string", "store" : true } } } } } } 
> >     > 
> >     > {"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7", 
> >     > "name":"Philadelphia Athletics"} 
> >     > 
> >     > On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore 
> >     > wrote: 
> >     > 
> >     > We tried many different test setups yesterday. The first setup we 
> >     > tried was: 
> >     > 
> >     > 1 Master, 2 Data nodes 38 indices 10 shards per index 1 replica 
> per 
> >     > index 760 total shards (380 primary, 760 total) Each index had 
> >     > 2,745 documents Each index was 218.9kb in size (according to the 
> >     > _cat/indices API) 
> >     > 
> >     > We realize that 10 shards per index with only 2 nodes is not a 
> good 
> >     > idea, so we changed that and reran the tests. 
> >     > 
> >     > We changed shards per index to the default of 5 and put 100 
> indices 
> >     > on the 2 boxes and ran into the same issue. It was the same 
> >     > dataset, so all other size information is correct. 
> >     > 
> >     > After that, we turned off one of the data nodes, set replicas to 0 
> >     > and shards per index to 1. With the same dataset, I loaded ~440 
> >     > indices and ran into the timeout issues with the Master and Data 
> >     > nodes just idling. 
> >     > 
> >     > This is just a test dataset that we came up with to quickly test 
> >     > our issues that contains no confidential information. Once we 
> >     > figure out the issues affecting this test dataset, we'll try 
> things 
> >     > with our real dataset. 
> >     > 
> >     > 
> >     > All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our 
> >     > current test version). We have also tried our real setup on 1.4.1 
> >     > to no avail. 
> >     > 
> >     > 
> >     > On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote: 
> >     > 
> >     > Can you elaborate on your dataset and structure; how many indexes, 
> >     > how many shards, how big they are etc. 
> >     > 
> >     > On 24 December 2014 at 07:36, Chris Moore <[email protected]> 
> >     > wrote: 
> >     > 
> >     > Updating again: 
> >     > 
> >     > If we reduce the number of shards per node to below ~350, the 
> >     > system operates fine. Once we go above that (number_of_indices * 
> >     > number_of_shards_per_index * number_of_replicas / 
> number_of_nodes), 
> >     > we start running into the described issues. 
> >     > 
> >     > On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote: 
> >     > 
> >     > Just a quick update, we duplicated our test environment to see if 
> >     > this issue was fixed by upgrading to 1.4.1 instead. We received 
> the 
> >     > same errors under 1.4.1. 
> >     > 
> >     > On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote: 
> >     > 
> >     > As a followup, I closed all the indices on the cluster. I would 
> >     > then open 1 index and optimize it down to 1 segment. I made it 
> >     > through ~60% of the indices (and probably ~45% of the data) before 
> >     > the same errors showed up in the master log and the same behavior 
> >     > resumed. 
> >     > 
> >     > On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote: 
> >     > 
> >     > I replied once, but it seems to have disappeared, so if this gets 
> >     > double posted, I'm sorry. 
> >     > 
> >     > We disabled all monitoring when we started looking into the issues 
> >     > to ensure there was no external load on ES. Everything we are 
> >     > currently seeing is just whatever activity ES generates 
> >     > internally. 
> >     > 
> >     > My understanding regarding optimizing indices is that you 
> shouldn't 
> >     > call it explicitly on indices that are regularly updating, rather 
> >     > you should let the background merge process handle things. As the 
> >     > majority of our indices regularly update, we don't explicitly call 
> >     > optimize on them. I can try to call it on them all and see if it 
> >     > helps. 
> >     > 
> >     > As for disk speed, we are currently running ES on SSDs. We have it 
> >     > in our roadmap to change that to RAIDed SSDs, but it hasn't been a 
> >     > priority as we have been getting acceptable performance thus far. 
> >     > 
> >     > On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote: 
> >     > 
> >     > Do you have a monitor tool running? 
> >     > 
> >     > I recommend to switch it off, and optimize your indices, and then 
> >     > update your monitoring tools. 
> >     > 
> >     > Seems you have many segments/slow disk to get them reported in 
> >     > 15s. 
> >     > 
> >     > Jörg 
> >     > 
> >     > Am 05.12.2014 16:10 schrieb "Chris Moore" <[email protected]>: 
>
> >     > 
> >     > This is running on Amazon EC2 in a VPC on dedicated instances. 
> >     > Physical network infrastructure is likely fine. Are there specific 
> >     > network issues you think we should look into? 
> >     > 
> >     > When we are in a problem state, we can communicate between the 
> >     > nodes just fine. I can run curl requests to ES (health checks, 
> etc) 
> >     > from the master node to the data nodes directly and they return as 
> >     > expected. So, there doesn't seem to be a socket exhaustion issue 
> >     > (additionally there are no kernel errors being reported). 
> >     > 
> >     > It feels like there is a queue/buffer filling up somewhere that 
> >     > once it has availability again, things start working. But, 
> >     > /_cat/thread_pool?v doesn't show anything above 0 (although, when 
> >     > we are in the problem state, it doesn't return a response if run 
> on 
> >     > master), nodes/hot_threads doesn't show anything going on, etc. 
> >     > 
> >     > On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey 
> >     > wrote: 
> >     > 
> >     > I would think the network is a prime suspect then, as there is no 
> >     > significant difference between 1.2.x and 1.3.x in relation to 
> >     > memory usage. And you'd certainly see OOMs in node logs if it was 
> a 
> >     > memory issue. 
> >     > 
> >     > On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore 
> >     > wrote: 
> >     > 
> >     > There is nothing (literally) in the log of either data node after 
> >     > the node joined events and nothing in the master log between index 
> >     > recovery and the first error message. 
> >     > 
> >     > There are 0 queries run before the errors start occurring (access 
> >     > to the nodes is blocked via a firewall, so the only communications 
> >     > are between the nodes). We have 50% of the RAM allocated to the 
> >     > heap on each node (4GB each). 
> >     > 
> >     > This cluster operated without issue under 1.1.2. Did something 
> >     > change between 1.1.2 and 1.3.5 that drastically increased idle 
> heap 
> >     > requirements? 
> >     > 
> >     > 
> >     > On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey 
> >     > wrote: 
> >     > 
> >     > Generally __ReceiveTimeoutTransp____ortExcepti__on is due to 
> >     > network disconnects or a node failing to respond due to heavy 
> load. 
> >     > What does the log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it 
> >     > has too little heap allocated. Rule of thumb is 1/2 available 
> >     > memory but <= 31GB 
> >     > 
> >     > On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller 
> >     > wrote: 
> >     > 
> >     > 
> >     > ES Version: 1.3.5 
> >     > 
> >     > OS: Ubuntu 14.04.1 LTS 
> >     > 
> >     > Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at 
> >     > AWS 
> >     > 
> >     > master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20) 
> >     > 
> >     > * * 
> >     > 
> >     > *After upgrading from ES 1.1.2...* 
> >     > 
> >     > 
> >     > 1. Startup ES on master 2. All nodes join cluster 3. [2014-12-03 
> >     > 20:30:54,789][INFO ][gateway ] [ip-10-0-1-18.ec2.internal] 
> >     > recovered [157] indices into cluster_state 4. Checked health a few 
> >     > times 
> >     > 
> >     > 
> >     > curl -XGET localhost:9200/_cat/health?v 
> >     > 
> >     > * * 
> >     > 
> >     > 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
> >     > recovery finishes), the log on the master node (10.0.1.18) 
> >     > reports: 
> >     > 
> >     > 
> >     > [2014-12-03 
> >     > 20:36:57,532][DEBUG][action.__ad____min.cluster.node.stats] 
> >     > [ip-10-0-1-18.ec2.internal] failed to execute on node 
> >     > [pYi3z5PgRh6msJX_armz_A] 
> >     > 
> >     > 
> >     
> org.elasticsearch.transport.__Re____ceiveTimeoutTransportExcepti__on____: 
> > 
> >     > 
> >     > 
> >     
> [ip-10-0-1-20.ec2.internal][__in____et[/10.0.1.20:9300]][__cluster/__n__odes/stats/n]
>  
>
> > 
> >     > request_id [17564] timed out after [15001ms] 
> >     > 
> >     > at 
> >     > 
> >     
> org.elasticsearch.transport.__Tr____ansportService$__TimeoutHandler.____run(__TransportService.java:356)
>  
>
> > 
> >     > 
> >     >  at 
> >     > 
> >     
> java.util.concurrent.__ThreadPoo____lExecutor.runWorker(__ThreadPool____Executor.java:1145)
>  
>
> > 
> >     > 
> >     >  at 
> >     > 
> >     
> java.util.concurrent.__ThreadPoo____lExecutor$Worker.run(__ThreadPoo____lExecutor.java:615)
>  
>
> > 
> >     > 
> >     >  at java.lang.Thread.run(Thread.__ja____va:745) 
> >     > 
> >     > 
> >     > 6. Every 30 seconds or 60 seconds, the above error is reported for 
> >     > one or more of the data nodes 
> >     > 
> >     > 7. During this time, queries (search, index, etc.) don’t return. 
> >     > They hang until the error state temporarily resolves itself (a 
> >     > varying time around 15-20 minutes) at which point the expected 
> >     > result is returned. 
> >     > 
> >     > -- You received this message because you are subscribed to the 
> >     > Google Groups "elasticsearch" group. To unsubscribe from this 
> group 
> >     > and stop receiving emails from it, send an email to 
> >     > elasticsearc...@googlegroups.__c__om. To view this discussion on 
> >     > the web visit 
> >     > 
> >     
> https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com
>  
> >     <
> https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com>
>  
>
> > 
> >     > 
> >     > 
> >     <
> https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>  
>
> > 
> >     > For more options, visit https://groups.google.com/d/__op__tout 
> >     <https://groups.google.com/d/__op__tout> 
> >     > <https://groups.google.com/d/optout 
> >     <https://groups.google.com/d/optout>>. 
> >     > 
> >     > -- You received this message because you are subscribed to the 
> >     > Google Groups "elasticsearch" group. To unsubscribe from this 
> group 
> >     > and stop receiving emails from it, send an email to 
> >     > elasticsearc...@googlegroups.__com. To view this discussion on the 
> >     > web visit 
> >     > 
> >     
> https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com
>  
> >     <
> https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com>
>  
>
> > 
> >     > 
> >     > 
> >     <
> https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>  
>
> > 
> >     > For more options, visit https://groups.google.com/d/__optout 
> >     <https://groups.google.com/d/__optout> 
> >     > <https://groups.google.com/d/optout 
> >     <https://groups.google.com/d/optout>>. 
> >     > 
> >     > 
> >     > -- You received this message because you are subscribed to the 
> >     > Google Groups "elasticsearch" group. To unsubscribe from this 
> group 
> >     > and stop receiving emails from it, send an email to 
> >     > [email protected] <javascript:> 
> >     > <mailto:[email protected] <javascript:> 
> <javascript:>>. 
> >     To view this 
> >     > discussion on the web visit 
> >     > 
> >     
> https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com>
>  
>
> > 
> >     > 
> >     > 
> >     <
> https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>  
>
> > 
> >     > 
> >     > For more options, visit https://groups.google.com/d/optout 
> >     <https://groups.google.com/d/optout>. 
> >     > 
> >     > 
> >     > -- You received this message because you are subscribed to the 
> >     > Google Groups "elasticsearch" group. To unsubscribe from this 
> group 
> >     > and stop receiving emails from it, send an email to 
> >     > [email protected] <javascript:> 
> >     > <mailto:[email protected] <javascript:> 
> <javascript:>>. 
> >     To view this 
> >     > discussion on the web visit 
> >     > 
> >     
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com>
>  
>
> > 
> >     > 
> >     > 
> >     <
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>  
>
> > 
> >     > For more options, visit https://groups.google.com/d/optout 
> >     <https://groups.google.com/d/optout>. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> > an email to [email protected] <javascript:> 
> > <mailto:[email protected] <javascript:>>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com
>  
> > <
> https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com?utm_medium=email&utm_source=footer>.
>  
>
> > For more options, visit https://groups.google.com/d/optout. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7dd123c2-93e9-4295-9d10-405b5c82669e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to