Moving data on the OS level without making ES aware can cause difficulties as you are seeing.
A few suggestions on how to resolve this and improve things in general; 1. Set your heap size to 31GB. 2. Use Oracle's java, not OpenJDK. 3. Set bootstrap.mlockall to true, you don't want to swap, ever. Given the large number of indexes you have on node1, and to get to a point where you can move some of these to a new node and stop the root problem, it's going to be worth closing some of the older indexes. So try these steps; 1. Stop node2. 2. Delete any data from the second node, to prevent things being auto imported again. 3. Start node1, or restart it if it's running. 4. Close all your indexes older than a month - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html. You can use wildcards in index names to make the update easier. What this will do is tell ES to not load the index metadata into memory, which will help with your OOM issue. 5. Start node2 and let it join the cluster. 6. Make sure the cluster is in a green state. If you're not already, use something like ElasticHQ, kopf or Marvel to monitor things. 7. Let the cluster rebalance the current open indexes. 8. Once that is ok and things are stable, reopen your closed indexes a month at a time, and let them rebalance. That should get you back up and running. Once you're there we can go back to your original post :) Regards, Mark Walkom Infrastructure Engineer Campaign Monitor email: [email protected] web: www.campaignmonitor.com On 6 May 2014 11:15, Nishchay Shah <[email protected]> wrote: > > Thanks Nate, but this doesn't work. node2 is not the master. So starting > it first didn't make sense, anyway I tried it and I couldn't execute > anything on a nonmaster node (node2) unless master was started > > I started node2 (non master) and ran this: curl -XPUT > localhost:9200/_cluster/settings -d > '{"transient":{"cluster.routing.allocation.disable_allocation":true}}' > after 30s I got this: > {"error":"MasterNotDiscoveredException[waited for [30s]]","status":503} > > I started node1 and as bloody expected elasticsearch copied all the > indexes :( .. > *"auto importing dangled indices"* > > I cannot believe I am unable to get this fundamental elasticsearch feature > working ! > > > On Mon, May 5, 2014 at 4:25 PM, Nate Fox <[email protected]> wrote: > >> Get node2 running with rock. Then issue a disable_allocation and then >> bring up node1. >> curl -XPUT localhost:9200/_cluster/settings -d >> '{"transient":{"cluster.routing.allocation.disable_allocation":true}}' >> >> From there, adjust the replica settings on the indexes down to 0 so they >> dont copy. Once thats set, change disable_allocation to false. >> >> >> >> >> >> >> >> On Mon, May 5, 2014 at 1:19 PM, Nish <[email protected]> wrote: >> >>> *.."- Fire up both nodes, make sure they both have the same cluster >>> name"* <= This is exactly what I wrote in my second message is where >>> Elasticsearch is messing up. When I move the index to a new node and delete >>> that index from master and then start master node and other data node, it >>> (master) throws a message: >>> "auto importing dangled indices" >>> This means master is now copying the "deleted" index that exists only on >>> other node to itself ! >>> >>> >>> Basically this is what happens: >>> >>> 1. Node1 Master: rock,paper,scissors >>> 2. I move rock from Node 1 to Node 2 (I verify by starting ONLY >>> node1 and I can see that I am missing data that was originally in "rock" >>> index, as expected, all good) >>> 3. SO node1 now has paper,scissors >>> 4. I start Node2 with ONLY "rock" index (verify independently, it >>> works) >>> 5. Then I start node 1 (master) and node 2(data) >>> 6. Node1 sees says "hey I don't have rock, but node2 has it, let me >>> copy it to myself" >>> >>> >>> >>> On Monday, May 5, 2014 3:44:17 PM UTC-4, Nate Fox wrote: >>> >>>> You might turn off the bootstrap.mlockall flag just for now - it'll >>>> make ES swap a ton, but your error message looks like an OS level issue. >>>> Make sure you have lots of swap available and grab some coffee. >>>> >>>> What I'd also try if turning off bootstrap.mlockall doesnt work: >>>> - Tarball the entire data directory and save the tarball somewhere >>>> (unless you dont care about the data) >>>> - Set 31Gb for your ES HEAP. There's plenty of docs out there that say >>>> not to go over 32Gb of ram cause it'll cause Java to go into 64bit mode. >>>> - Copy the entire data dir to node2 >>>> - Go into the data dir on node1 and delete half of the indexes >>>> - Go into the data dir on node2 and delete the *other* half of the >>>> indexes >>>> - Fire up both nodes, make sure they both have the same cluster name >>>> >>>> I have no idea if this'll work, I'm by no means an ES expert. :) >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, May 5, 2014 at 12:32 PM, Nish <[email protected]> wrote: >>>> >>>>> Currently I have 279 indexes on a single node and elasticsearch >>>>> starts for few minutes and dies ; I only have 60G RAM on disk and as far >>>>> as >>>>> I know 60% is the max that one should allocate to elasticsearch ; I tried >>>>> allocating 38G and it lasted for few more minutes and it died. >>>>> >>>>> *(I think there's some state files that tell ES/Lucene which indexes >>>>> are on disk)* => Where is this ? How do I fix it so that it doesn't >>>>> move all indexes to all nodes ? I want to split the ~280 indexes into two >>>>> nodes of 140each. So far I am not able to achieve this as the master keeps >>>>> moving nodes to itself ! >>>>> >>>>> On Monday, May 5, 2014 3:25:05 PM UTC-4, Nate Fox wrote: >>>>>> >>>>>> How many indexes do you have? It almost looks like the system itself >>>>>> cant allocate the ram needed? >>>>>> You might try jacking up the nofile to something like 999999 as well? >>>>>> I'd definitely go with 31g heapsize. >>>>>> >>>>>> As for moving indexes, you might be able to copy the entire data >>>>>> store, then remove some (I think there's some state files that tell >>>>>> ES/Lucene which indexes are on disk), so it might recover if its missing >>>>>> some and sees the others on another node? >>>>>> >>>>>> As for your other questions, it depends on usage as to how many nodes >>>>>> - especially search activity while indexing. We have 230 indexes (1740 >>>>>> shards) on 8 data nodes (5.7Tb / 6.1B docs). So it can definitely handle >>>>>> a >>>>>> lot more than what you're throwing at it. We dont search often nor do we >>>>>> load a ton of data at once. >>>>>> >>>>>> >>>>>> On Sunday, May 4, 2014 7:13:09 AM UTC-7, Nish wrote: >>>>>>> >>>>>>> elasticsearch is set as a single node instance on a 60G RAM and >>>>>>> 32*2.6GHz machine. I am actively indexing historic data with logstash. >>>>>>> It >>>>>>> worked well with ~300 million documents (search and indexing were doing >>>>>>> ok) >>>>>>> , but all of a sudden es fails to starts and keep itself up. It starts >>>>>>> for >>>>>>> few minutes and I can query but fails with out of memory error. I >>>>>>> monitor >>>>>>> the memory and atleast 12G of memory is available when it fails. I had >>>>>>> set >>>>>>> the es_heap_size to 31G and then reduced it to 28, 24 and 18 and the >>>>>>> same >>>>>>> error every time (see dump below) >>>>>>> >>>>>>> *My security limits are as under (this is a test/POC server thus >>>>>>> "root" user) * >>>>>>> >>>>>>> root soft nofile 65536 >>>>>>> root hard nofile 65536 >>>>>>> root - memlock unlimited >>>>>>> >>>>>>> *ES settings * >>>>>>> config]# grep -v "^#" elasticsearch.yml | grep -v "^$" >>>>>>> bootstrap.mlockall: true >>>>>>> >>>>>>> *echo $ES_HEAP_SIZE* >>>>>>> 18432m >>>>>>> >>>>>>> ---DUMP---- >>>>>>> >>>>>>> # bin/elasticsearch >>>>>>> [2014-05-04 13:30:12,653][INFO ][node ] >>>>>>> [Sabretooth] version[1.1.1], pid[19309], build[f1585f0/2014-04-16T14: >>>>>>> 27:12Z] >>>>>>> [2014-05-04 13:30:12,653][INFO ][node ] >>>>>>> [Sabretooth] initializing ... >>>>>>> [2014-05-04 13:30:12,669][INFO ][plugins ] >>>>>>> [Sabretooth] loaded [], sites [] >>>>>>> [2014-05-04 13:30:15,390][INFO ][node ] >>>>>>> [Sabretooth] initialized >>>>>>> [2014-05-04 13:30:15,390][INFO ][node ] >>>>>>> [Sabretooth] starting ... >>>>>>> [2014-05-04 13:30:15,531][INFO ][transport ] >>>>>>> [Sabretooth] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, >>>>>>> publish_address >>>>>>> {inet[/10.109.136.59:9300]} >>>>>>> [2014-05-04 13:30:18,553][INFO ][cluster.service ] >>>>>>> [Sabretooth] new_master [Sabretooth][eocFkTYMQnSTUar94A2vHw][ip-10- >>>>>>> 109-136-59][inet[/10.109.136.59:9300]], reason: zen-disco-join >>>>>>> (elected_as_master) >>>>>>> [2014-05-04 13:30:18,579][INFO ][discovery ] >>>>>>> [Sabretooth] elasticsearch/eocFkTYMQnSTUar94A2vHw >>>>>>> [2014-05-04 13:30:18,790][INFO ][http ] >>>>>>> [Sabretooth] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, >>>>>>> publish_address >>>>>>> {inet[/10.109.136.59:9200]} >>>>>>> [2014-05-04 13:30:19,976][INFO ][gateway ] >>>>>>> [Sabretooth] recovered [278] indices into cluster_state >>>>>>> [2014-05-04 13:30:19,984][INFO ][node ] >>>>>>> [Sabretooth] started >>>>>>> OpenJDK 64-Bit Server VM warning: Attempt to protect stack guard >>>>>>> pages failed. >>>>>>> OpenJDK 64-Bit Server VM warning: Attempt to deallocate stack guard >>>>>>> pages failed. >>>>>>> OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory( >>>>>>> 0x00000007f7c70000, 196608, 0) failed; error='Cannot allocate >>>>>>> memory' (errno=12) >>>>>>> # >>>>>>> # There is insufficient memory for the Java Runtime Environment to >>>>>>> continue. >>>>>>> # Native memory allocation (malloc) failed to allocate 196608 bytes >>>>>>> for committing reserved memory. >>>>>>> # An error report file with more information is saved as: >>>>>>> # /tmp/jvm-19309/hs_error.log >>>>>>> >>>>>>> >>>>>>> >>>>>>> ---- >>>>>>> *user untergeek on #logstash told me that I have reached a max >>>>>>> number of indices on a single node. Here are my questions: * >>>>>>> >>>>>>> 1. Can I move half of my indexes to a new node ? If yes, how to >>>>>>> do that without compromising indexes >>>>>>> 2. Logstash makes 1 index per day and I want to have 2 years of >>>>>>> data indexable ; Can I combine multiple indexes into one ? Like one >>>>>>> month >>>>>>> per month : this will mean I will not have more than 24 indexes. >>>>>>> 3. How many nodes are ideal for 24 moths of data ~1.5G a day >>>>>>> >>>>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "elasticsearch" group. >>>>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>>>> topic/elasticsearch/cEimyMnhSv0/unsubscribe. >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected]. >>>>> >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/564e2951-ed54-4f34-97a9-4de88f187a7a% >>>>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/564e2951-ed54-4f34-97a9-4de88f187a7a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "elasticsearch" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/elasticsearch/cEimyMnhSv0/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/5de77e8a-46dd-43c9-b4ad-557d117072ff%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/5de77e8a-46dd-43c9-b4ad-557d117072ff%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAHU4sP_02AfqaFOdZU6ZOmua32BuG4w2tv125Vyu2j7HAZy93w%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAHU4sP_02AfqaFOdZU6ZOmua32BuG4w2tv125Vyu2j7HAZy93w%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CANma5K74Q97T%2BqJTsqp2%3DSjur9qzAnfpXaLfVzWBevK1DarPZA%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CANma5K74Q97T%2BqJTsqp2%3DSjur9qzAnfpXaLfVzWBevK1DarPZA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Zg%2B%3D9%3Dy5b%2BP81_%2BVduTRAV__cg2FYNoYxFtcjLYMm-QA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
