Re: jdk fails with out of memory error / es critical index counts

Nishchay Shah Mon, 05 May 2014 20:07:24 -0700

FYI settings:
*Master*:
[root@ip-10-169-36-251 logstash-2013.12.05]# grep -vE "^$|^#"
/xx/elasticsearch-1.1.1/config/elasticsearch.yml
cluster.name: elasticsearchtest
node.name: "node1"
node.master: true
node.data: true
index.number_of_replicas: 0
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.169.36.251", "10.186.152.19"]
*Non Master*
[root@ip-10-186-152-19 logstash-2013.12.05]# grep -vE "^$|^#"
/elasticsearch/es/elasticsearch-1.1.1/config/elasticsearch.yml
cluster.name: elasticsearchtest
node.name: "node2"
node.master: false
node.data: true
index.number_of_replicas: 0
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.169.36.251","10.186.152.19"]



On Mon, May 5, 2014 at 11:01 PM, Nishchay Shah <[email protected]>wrote:

> Probably not.
>
> I deleted all data from slave and restarted both servers and I see this:
>
> *Master: *
> [root@ip-10-169-36-251 logstash-2013.12.22]#  du -h --max-depth=1
> 16M    ./0
> 16M    ./1
> 8.0K    ./_state
> 15M    ./4
> 15M    ./3
> 15M    ./2
> 75M    .
>
> *Data: *
>
> [root@ip-10-186-152-19 logstash-2013.12.22]# du -h --max-depth=1
> 16M    ./0
> 16M    ./1
> 15M    ./4
> 15M    ./3
> 15M    ./2
> 75M    .
>
>
> On Mon, May 5, 2014 at 10:53 PM, Mark Walkom <[email protected]>wrote:
>
>> Don't copy indexes on the OS level!
>>
>> Is your new cluster balancing the shards?
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: [email protected]
>> web: www.campaignmonitor.com
>>
>>
>> On 6 May 2014 12:46, Nishchay Shah <[email protected]> wrote:
>>
>>> Hey Mark,
>>> Thanks for the response. I have currently created two new medium test
>>> instances (1 master 1 data only) because I didn't want to mess with the
>>> main dataset. In my test setup, I have about 600MB of data ; 7 indexes
>>>
>>> After looking around a lot I saw that the directory organization is
>>> /elasticsearch/es/elasticsearch-1.1.1/data/elasticsearchtest/nodes/*<node
>>> number>*/ and the master node has only 1 directory
>>>
>>> (master)
>>> # ls /elasticsearch/es/elasticsearch-1.1.1/data/elasticsearchtest/nodes
>>> 0
>>>
>>> So on node2 I created a "1" directory and moved 1 index from master to
>>> data ; So master now has six indexes in 0 and data has one in 1.
>>> When I started elasticsearch after that I got to a point where the
>>> master is not NOT copying the data back to itself.. but now node2 is
>>> copying master's data and making a "0" directory ; Also, I am unable to
>>> query the node2's data !
>>>
>>>
>>>
>>>
>>> On Mon, May 5, 2014 at 9:34 PM, Mark Walkom 
>>> <[email protected]>wrote:
>>>
>>>> Moving data on the OS level without making ES aware can cause
>>>> difficulties as you are seeing.
>>>>
>>>>  A few suggestions on how to resolve this and improve things in
>>>> general;
>>>>
>>>>    1. Set your heap size to 31GB.
>>>>    2. Use Oracle's java, not OpenJDK.
>>>>    3. Set bootstrap.mlockall to true, you don't want to swap, ever.
>>>>
>>>> Given the large number of indexes you have on node1, and to get to a
>>>> point where you can move some of these to a new node and stop the root
>>>> problem, it's going to be worth closing some of the older indexes. So try
>>>> these steps;
>>>>
>>>>    1. Stop node2.
>>>>    2. Delete any data from the second node, to prevent things being
>>>>    auto imported again.
>>>>    3. Start node1, or restart it if it's running.
>>>>    4. Close all your indexes older than a month -
>>>>    
>>>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html.
>>>>    You can use wildcards in index names to make the update easier. What 
>>>> this
>>>>    will do is tell ES to not load the index metadata into memory, which 
>>>> will
>>>>    help with your OOM issue.
>>>>    5. Start node2 and let it join the cluster.
>>>>    6. Make sure the cluster is in a green state. If you're not
>>>>    already, use something like ElasticHQ, kopf or Marvel to monitor things.
>>>>    7. Let the cluster rebalance the current open indexes.
>>>>    8. Once that is ok and things are stable, reopen your closed
>>>>    indexes a month at a time, and let them rebalance.
>>>>
>>>> That should get you back up and running. Once you're there we can go
>>>> back to your original post :)
>>>>
>>>> Regards,
>>>> Mark Walkom
>>>>
>>>> Infrastructure Engineer
>>>> Campaign Monitor
>>>> email: [email protected]
>>>> web: www.campaignmonitor.com
>>>>
>>>>
>>>> On 6 May 2014 11:15, Nishchay Shah <[email protected]> wrote:
>>>>
>>>>>
>>>>> Thanks Nate, but this doesn't work. node2 is not the master. So
>>>>> starting it first didn't make sense, anyway I tried it and I couldn't
>>>>> execute anything on a nonmaster node (node2) unless master was started
>>>>>
>>>>> I started node2 (non master) and ran this: curl -XPUT
>>>>> localhost:9200/_cluster/settings -d
>>>>> '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
>>>>> after 30s I got this:
>>>>> {"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}
>>>>>
>>>>> I started node1 and as bloody expected elasticsearch copied all the
>>>>> indexes :( ..
>>>>> *"auto importing dangled indices"*
>>>>>
>>>>> I cannot believe I am unable to get this fundamental elasticsearch
>>>>> feature working !
>>>>>
>>>>>
>>>>> On Mon, May 5, 2014 at 4:25 PM, Nate Fox <[email protected]> wrote:
>>>>>
>>>>>> Get node2 running with rock. Then issue a disable_allocation and then
>>>>>> bring up node1.
>>>>>> curl -XPUT localhost:9200/_cluster/settings -d
>>>>>> '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
>>>>>>
>>>>>> From there, adjust the replica settings on the indexes down to 0 so
>>>>>> they dont copy. Once thats set, change disable_allocation to false.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 5, 2014 at 1:19 PM, Nish <[email protected]> wrote:
>>>>>>
>>>>>>> *.."- Fire up both nodes, make sure they both have the same cluster
>>>>>>> name"* <= This is exactly what I wrote in my second message is
>>>>>>> where Elasticsearch is messing up. When I move the index to a new node 
>>>>>>> and
>>>>>>> delete that index from master and then start master node and other data
>>>>>>> node, it (master) throws a message:
>>>>>>> "auto importing dangled indices"
>>>>>>> This means master is now copying the "deleted" index that exists
>>>>>>> only on other node to itself !
>>>>>>>
>>>>>>>
>>>>>>> Basically this is what happens:
>>>>>>>
>>>>>>>    1. Node1 Master: rock,paper,scissors
>>>>>>>    2. I move rock from Node 1 to Node 2 (I verify by starting ONLY
>>>>>>>    node1 and I can see that I am missing data that was originally in 
>>>>>>> "rock"
>>>>>>>    index, as expected, all good)
>>>>>>>    3. SO node1 now has paper,scissors
>>>>>>>    4. I start Node2 with ONLY "rock" index (verify independently,
>>>>>>>    it works)
>>>>>>>    5. Then I start node 1 (master) and node 2(data)
>>>>>>>    6. Node1 sees says "hey I don't have rock, but node2 has it, let
>>>>>>>    me copy it to myself"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Monday, May 5, 2014 3:44:17 PM UTC-4, Nate Fox wrote:
>>>>>>>
>>>>>>>> You might turn off the bootstrap.mlockall flag just for now - it'll
>>>>>>>> make ES swap a ton, but your error message looks like an OS level 
>>>>>>>> issue.
>>>>>>>> Make sure you have lots of swap available and grab some coffee.
>>>>>>>>
>>>>>>>> What I'd also try if turning off bootstrap.mlockall doesnt work:
>>>>>>>> - Tarball the entire data directory and save the tarball somewhere
>>>>>>>> (unless you dont care about the data)
>>>>>>>> - Set 31Gb for your ES HEAP. There's plenty of docs out there that
>>>>>>>> say not to go over 32Gb of ram cause it'll cause Java to go into 64bit 
>>>>>>>> mode.
>>>>>>>> - Copy the entire data dir to node2
>>>>>>>> - Go into the data dir on node1 and delete half of the indexes
>>>>>>>> - Go into the data dir on node2 and delete the *other* half of the
>>>>>>>> indexes
>>>>>>>> - Fire up both nodes, make sure they both have the same cluster name
>>>>>>>>
>>>>>>>> I have no idea if this'll work, I'm by no means an ES expert. :)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 5, 2014 at 12:32 PM, Nish <[email protected]> wrote:
>>>>>>>>
>>>>>>>>>  Currently I have 279 indexes on a single node and elasticsearch
>>>>>>>>> starts for few minutes and dies ; I only have 60G RAM on disk and as 
>>>>>>>>> far as
>>>>>>>>> I know 60% is the max that one should allocate to elasticsearch ; I 
>>>>>>>>> tried
>>>>>>>>> allocating 38G and it lasted for few more minutes and it died.
>>>>>>>>>
>>>>>>>>> *(I think there's some state files that tell ES/Lucene which
>>>>>>>>> indexes are on disk)* => Where is this ? How do I fix it so that
>>>>>>>>> it doesn't move all indexes to all nodes ? I want to split the ~280 
>>>>>>>>> indexes
>>>>>>>>> into two nodes of 140each. So far I am not able to achieve this as the
>>>>>>>>> master keeps moving nodes to itself !
>>>>>>>>>
>>>>>>>>> On Monday, May 5, 2014 3:25:05 PM UTC-4, Nate Fox wrote:
>>>>>>>>>>
>>>>>>>>>> How many indexes do you have? It almost looks like the system
>>>>>>>>>> itself cant allocate the ram needed?
>>>>>>>>>> You might try jacking up the nofile to something like 999999 as
>>>>>>>>>> well? I'd definitely go with 31g heapsize.
>>>>>>>>>>
>>>>>>>>>> As for moving indexes, you might be able to copy the entire data
>>>>>>>>>> store, then remove some (I think there's some state files that tell
>>>>>>>>>> ES/Lucene which indexes are on disk), so it might recover if its 
>>>>>>>>>> missing
>>>>>>>>>> some and sees the others on another node?
>>>>>>>>>>
>>>>>>>>>> As for your other questions, it depends on usage as to how many
>>>>>>>>>> nodes - especially search activity while indexing. We have 230 
>>>>>>>>>> indexes
>>>>>>>>>> (1740 shards) on 8 data nodes (5.7Tb / 6.1B docs). So it can 
>>>>>>>>>> definitely
>>>>>>>>>> handle a lot more than what you're throwing at it. We dont search 
>>>>>>>>>> often nor
>>>>>>>>>> do we load a ton of data at once.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sunday, May 4, 2014 7:13:09 AM UTC-7, Nish wrote:
>>>>>>>>>>>
>>>>>>>>>>> elasticsearch is set as a single node instance on a  60G RAM and
>>>>>>>>>>> 32*2.6GHz machine. I am actively indexing historic data with 
>>>>>>>>>>> logstash. It
>>>>>>>>>>> worked well with ~300 million documents (search and indexing were 
>>>>>>>>>>> doing ok)
>>>>>>>>>>> , but all of a sudden es fails to starts and keep itself up. It 
>>>>>>>>>>> starts for
>>>>>>>>>>> few minutes and I can query but fails with out of memory error. I 
>>>>>>>>>>> monitor
>>>>>>>>>>> the memory and atleast 12G of memory is available when it fails. I 
>>>>>>>>>>> had set
>>>>>>>>>>> the es_heap_size to 31G and then reduced it to 28, 24 and 18 and 
>>>>>>>>>>> the same
>>>>>>>>>>> error every time (see dump below)
>>>>>>>>>>>
>>>>>>>>>>> *My security limits are as under  (this is a test/POC server
>>>>>>>>>>> thus "root" user) *
>>>>>>>>>>>
>>>>>>>>>>> root   soft    nofile          65536
>>>>>>>>>>> root   hard    nofile          65536
>>>>>>>>>>> root   -       memlock         unlimited
>>>>>>>>>>>
>>>>>>>>>>> *ES settings *
>>>>>>>>>>> config]# grep -v "^#" elasticsearch.yml | grep -v "^$"
>>>>>>>>>>>  bootstrap.mlockall: true
>>>>>>>>>>>
>>>>>>>>>>> *echo $ES_HEAP_SIZE*
>>>>>>>>>>> 18432m
>>>>>>>>>>>
>>>>>>>>>>> ---DUMP----
>>>>>>>>>>>
>>>>>>>>>>> # bin/elasticsearch
>>>>>>>>>>> [2014-05-04 13:30:12,653][INFO ][node                     ]
>>>>>>>>>>> [Sabretooth] version[1.1.1], pid[19309], 
>>>>>>>>>>> build[f1585f0/2014-04-16T14:
>>>>>>>>>>> 27:12Z]
>>>>>>>>>>> [2014-05-04 13:30:12,653][INFO ][node                     ]
>>>>>>>>>>> [Sabretooth] initializing ...
>>>>>>>>>>> [2014-05-04 13:30:12,669][INFO ][plugins                  ]
>>>>>>>>>>> [Sabretooth] loaded [], sites []
>>>>>>>>>>> [2014-05-04 13:30:15,390][INFO ][node                     ]
>>>>>>>>>>> [Sabretooth] initialized
>>>>>>>>>>> [2014-05-04 13:30:15,390][INFO ][node                     ]
>>>>>>>>>>> [Sabretooth] starting ...
>>>>>>>>>>> [2014-05-04 13:30:15,531][INFO ][transport                ]
>>>>>>>>>>> [Sabretooth] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, 
>>>>>>>>>>> publish_address
>>>>>>>>>>> {inet[/10.109.136.59:9300]}
>>>>>>>>>>> [2014-05-04 13:30:18,553][INFO ][cluster.service          ]
>>>>>>>>>>> [Sabretooth] new_master [Sabretooth][eocFkTYMQnSTUar94
>>>>>>>>>>> A2vHw][ip-10-109-136-59][inet[/10.109.136.59:9300]], reason:
>>>>>>>>>>> zen-disco-join (elected_as_master)
>>>>>>>>>>> [2014-05-04 13:30:18,579][INFO ][discovery                ]
>>>>>>>>>>> [Sabretooth] elasticsearch/eocFkTYMQnSTUar94A2vHw
>>>>>>>>>>> [2014-05-04 13:30:18,790][INFO ][http                     ]
>>>>>>>>>>> [Sabretooth] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, 
>>>>>>>>>>> publish_address
>>>>>>>>>>> {inet[/10.109.136.59:9200]}
>>>>>>>>>>> [2014-05-04 13:30:19,976][INFO ][gateway                  ]
>>>>>>>>>>> [Sabretooth] recovered [278] indices into cluster_state
>>>>>>>>>>> [2014-05-04 13:30:19,984][INFO ][node                     ]
>>>>>>>>>>> [Sabretooth] started
>>>>>>>>>>> OpenJDK 64-Bit Server VM warning: Attempt to protect stack guard
>>>>>>>>>>> pages failed.
>>>>>>>>>>> OpenJDK 64-Bit Server VM warning: Attempt to deallocate stack
>>>>>>>>>>> guard pages failed.
>>>>>>>>>>> OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(
>>>>>>>>>>> 0x00000007f7c70000, 196608, 0) failed; error='Cannot allocate
>>>>>>>>>>> memory' (errno=12)
>>>>>>>>>>> #
>>>>>>>>>>> # There is insufficient memory for the Java Runtime Environment
>>>>>>>>>>> to continue.
>>>>>>>>>>> # Native memory allocation (malloc) failed to allocate 196608
>>>>>>>>>>> bytes for committing reserved memory.
>>>>>>>>>>> # An error report file with more information is saved as:
>>>>>>>>>>> # /tmp/jvm-19309/hs_error.log
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>> *user untergeek on #logstash told me that I have reached a max
>>>>>>>>>>> number of indices on a single node. Here are my questions: *
>>>>>>>>>>>
>>>>>>>>>>>    1. Can I move half of my indexes to a new node ? If yes, how
>>>>>>>>>>>    to do that without compromising indexes
>>>>>>>>>>>    2. Logstash makes 1 index per day and I want to have 2 years
>>>>>>>>>>>    of data indexable ; Can I combine multiple indexes into one ? 
>>>>>>>>>>> Like one
>>>>>>>>>>>    month per month : this will mean I will not have more than 24 
>>>>>>>>>>> indexes.
>>>>>>>>>>>    3. How many nodes are ideal for 24 moths of data ~1.5G a day
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>>>> the Google Groups "elasticsearch" group.
>>>>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>>>>>>>> topic/elasticsearch/cEimyMnhSv0/unsubscribe.
>>>>>>>>>  To unsubscribe from this group and all its topics, send an email
>>>>>>>>> to [email protected].
>>>>>>>>>
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/564e2951-
>>>>>>>>> ed54-4f34-97a9-4de88f187a7a%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/564e2951-ed54-4f34-97a9-4de88f187a7a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>> the Google Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this topic, visit
>>>>>>> https://groups.google.com/d/topic/elasticsearch/cEimyMnhSv0/unsubscribe
>>>>>>> .
>>>>>>>  To unsubscribe from this group and all its topics, send an email to
>>>>>>> [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/5de77e8a-46dd-43c9-b4ad-557d117072ff%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/5de77e8a-46dd-43c9-b4ad-557d117072ff%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/elasticsearch/CAHU4sP_02AfqaFOdZU6ZOmua32BuG4w2tv125Vyu2j7HAZy93w%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAHU4sP_02AfqaFOdZU6ZOmua32BuG4w2tv125Vyu2j7HAZy93w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/elasticsearch/CANma5K74Q97T%2BqJTsqp2%3DSjur9qzAnfpXaLfVzWBevK1DarPZA%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CANma5K74Q97T%2BqJTsqp2%3DSjur9qzAnfpXaLfVzWBevK1DarPZA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CAEM624Zg%2B%3D9%3Dy5b%2BP81_%2BVduTRAV__cg2FYNoYxFtcjLYMm-QA%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAEM624Zg%2B%3D9%3Dy5b%2BP81_%2BVduTRAV__cg2FYNoYxFtcjLYMm-QA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CANma5K4oQed7UteJioY-zCSQVU0z1rQWsqgWUoeELE6%3DNOfS8w%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CANma5K4oQed7UteJioY-zCSQVU0z1rQWsqgWUoeELE6%3DNOfS8w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAEM624ZWXuxDoOy6EVxbsqmXvdzRLkr4Waq4DB62vjyd6na4Ow%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAEM624ZWXuxDoOy6EVxbsqmXvdzRLkr4Waq4DB62vjyd6na4Ow%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANma5K4MP3QwGAgv%2BDpWLuYceMpQUamQs3boaxFGR6xXwhE9Jg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: jdk fails with out of memory error / es critical index counts

Reply via email to