Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Pete GS Tue, 12 May 2015 14:52:40 -0700

No further input on this?

The Graylog master node now seems to regularly drop out also with the "Did 
not find meta info of this node. Re-registering." message and it is under 
no load as our load balancer doesn't direct any input messages to it.


Cheers, Pete

On Thursday, 7 May 2015 07:44:41 UTC+10, Pete GS wrote:
>
> I've come back to the office this morning and discovered we had an 
> ElasticSearch issue last night which has resulted in lots of unprocessed 
> messages in the journal.
>
> All the Graylog nodes are busy processing these and it seems to be slowly 
> crunching through them.
>
> Load average (using htop) varies across the four nodes but I'm seeing a 
> minimum of 13.59 11.80 and a maximum of 24.81 24.64.
>
> Interestingly enough the process buffer is only full on one of the nodes, 
> the other three appear to be 10% full or less.
>
> The output buffers are all empty.
>
> The issue with ElasticSearch was running out of disk space which I've 
> resolved for the moment but my business case for new hardware should solve 
> that permanently.
>
> What other info can I give you guys to help me look in the right direction?
>
> Cheers, Pete
>
> On Wednesday, 6 May 2015 07:33:31 UTC+10, Pete GS wrote:
>>
>> Thanks for the replies guys. I'm away from the office today but will 
>> check these things tomorrow.
>>
>> Mathieu, I will check the load average but from memory the 5 minute 
>> average was around 12 or 18. I will confirm this tomorrow though.
>>
>> As for the "co stop" metric, I haven't used esxtop on these hosts but I 
>> have looked at the CPU Ready metric and it seems to be ok (sub 5% 
>> sustained). One of the physical hosts has exactly the same number of CPU's 
>> allocated as the VM"s running on it, but the other two physical hosts have 
>> no over-subscription of CPU's at all.There is no memory over subscription 
>> on any hosts either.
>>
>> For the moment I have simply increased the CPU's on the existing nodes as 
>> well as adding the two new ones. I am putting together a business case for 
>> new hardware for the ElasticSearch cluster and if this goes ahead I will 
>> move to a model of more Graylog nodes with less CPU's and memory for each 
>> node as I think that will scale better.
>>
>> Arie, I will increase the output buffer processors tomorrow to see what 
>> happens, but I do know that the process buffer gets quite full at times 
>> while the output buffer is usually almost empty.
>>
>> On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <
>> [email protected]> wrote:
>>
>>> Also check « co stop » metric on VMware. I am sure you have too many 
>>> vCPUs.
>>>
>>> Le 5 mai 2015 à 16:21, Arie <[email protected]> a écrit :
>>>
>>> What happens when you raise "outputbuffer_processors = 5" to 
>>> "outputbuffer_processors = 10" ?
>>>
>>> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS:
>>>>
>>>> Yesterday I did a yum update on all Graylog and MongoDB nodes and since 
>>>> doing that and rebooting them all (there was a kernel update) it seems 
>>>> that 
>>>> there are no longer issues connecting to the Mongo database.
>>>>
>>>> However, I'm still seeing excessively high CPU usage on the Graylog 
>>>> nodes where all vCPU's are regularly exceeding 95%.
>>>>
>>>> What can contribute to this? I'm a little stumped at present.
>>>>
>>>> I would say our average messages/second is around 5,000 to 6,000 with 
>>>> peaks up to about 12,000.
>>>>
>>>> Cheers, Pete
>>>>
>>>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote:
>>>>>
>>>>> Does anyone have any thoughts on this?
>>>>>
>>>>> Even if someone could identify some scenarios that would cause high 
>>>>> CPU on Graylog servers and in what circumstances Graylog would have 
>>>>> trouble 
>>>>> contacting the MongoDB servers.
>>>>>
>>>>> Cheers, Pete
>>>>>
>>>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We acquired a company a while ago and last week we added all of their 
>>>>>> logs to our Graylog environment which all come in from their Syslog 
>>>>>> server 
>>>>>> via UDP.
>>>>>>
>>>>>> After this, I noticed that the Graylog servers were maxing CPU so to 
>>>>>> alleviate this I increased CPU resources to the existing servers and 
>>>>>> added 
>>>>>> two new servers.
>>>>>>
>>>>>> I'm still seeing generally high CPU usage with peaks of 100% on all 
>>>>>> four of the Graylog servers but I now have issues where they also seem 
>>>>>> to 
>>>>>> have issues connecting to MongoDB.
>>>>>>
>>>>>> I see lots of "[NodePingThread] Did not find meta info of this node. 
>>>>>> Re-registering." streaming through the log files but it only seems to 
>>>>>> happen when I have more than two Graylog servers running.
>>>>>>
>>>>>> I have verified NTP is installed and configured and all servers 
>>>>>> including the MongoDB and ElasticSearch servers are sync'ing with the 
>>>>>> same 
>>>>>> NTP servers.
>>>>>>
>>>>>> We're doing less than 10,000 messages per second so with the 
>>>>>> resources I've allocated I would have expected no issues whatsoever.
>>>>>>
>>>>>> I have seen this link: 
>>>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but 
>>>>>> I don't believe it is our issue.
>>>>>>
>>>>>> If it truly is being caused by doing lots of reverse DNS lookups, I 
>>>>>> would expect tcpdump to show me that traffic to our DNS servers, but I 
>>>>>> see 
>>>>>> almost no DNS lookups at all.
>>>>>>
>>>>>> We have 6 inputs in total but only one receives the bulk of the 
>>>>>> Syslog UDP messages. Most of the other inputs are GELF UDP inputs.
>>>>>>
>>>>>> We also have 11 streams, however pausing these streams seems to have 
>>>>>> little to no impact on the CPU usage.
>>>>>>
>>>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update 
>>>>>> 2 with plenty of physical hardware available to service the workload 
>>>>>> (little to no contention).
>>>>>>
>>>>>> The original two have 20 vCPU's and 32GB RAM, the additional two have 
>>>>>> 16 vCPU's and 32GB RAM.
>>>>>>
>>>>>> Java heap on all is set to 16GB.
>>>>>>
>>>>>> This is all running on CentOS 6.
>>>>>>
>>>>>> Any input would be greatly appreciated as I'm a bit stumped on how to 
>>>>>> get this resolved at present.
>>>>>>
>>>>>> Here is the config file I'm using (censored where appropriate):
>>>>>>
>>>>>> is_master = false
>>>>>> node_id_file = /etc/graylog2/server/node-id
>>>>>> password_secret = <Censored>
>>>>>> root_username = <Censored>
>>>>>> root_password_sha2 = <Censored>
>>>>>> plugin_dir = /usr/share/graylog2-server/plugin
>>>>>> rest_listen_uri = http://172.22.20.66:12900/
>>>>>>
>>>>>> elasticsearch_max_docs_per_index = 20000000
>>>>>> elasticsearch_max_number_of_indices = 999
>>>>>> retention_strategy = close
>>>>>> elasticsearch_shards = 4
>>>>>> elasticsearch_replicas = 1
>>>>>> elasticsearch_index_prefix = graylog2
>>>>>> allow_leading_wildcard_searches = true
>>>>>> allow_highlighting = true
>>>>>> elasticsearch_cluster_name = graylog2
>>>>>> elasticsearch_node_name = bne3-0002las
>>>>>> elasticsearch_node_master = false
>>>>>> elasticsearch_node_data = false
>>>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false
>>>>>> elasticsearch_discovery_zen_ping_unicast_hosts = 
>>>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,
>>>>>> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,
>>>>>> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,
>>>>>> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,
>>>>>> bne3-0009lai.server-web.com:9300
>>>>>> elasticsearch_cluster_discovery_timeout = 5000
>>>>>> elasticsearch_discovery_initial_state_timeout = 3s
>>>>>> elasticsearch_analyzer = standard
>>>>>>
>>>>>> output_batch_size = 5000
>>>>>> output_flush_interval = 1
>>>>>> processbuffer_processors = 20
>>>>>> outputbuffer_processors = 5
>>>>>> #outputbuffer_processor_keep_alive_time = 5000
>>>>>> #outputbuffer_processor_threads_core_pool_size = 3
>>>>>> #outputbuffer_processor_threads_max_pool_size = 30
>>>>>> #udp_recvbuffer_sizes = 1048576
>>>>>> processor_wait_strategy = blocking
>>>>>> ring_size = 65536
>>>>>>
>>>>>> inputbuffer_ring_size = 65536
>>>>>> inputbuffer_processors = 2
>>>>>> inputbuffer_wait_strategy = blocking
>>>>>>
>>>>>> message_journal_enabled = true
>>>>>> message_journal_dir = /var/lib/graylog-server/journal
>>>>>> message_journal_max_age = 24h
>>>>>> message_journal_max_size = 150gb
>>>>>> message_journal_flush_age = 1m
>>>>>> message_journal_flush_interval = 1000000
>>>>>> message_journal_segment_age = 1h
>>>>>> message_journal_segment_size = 1gb
>>>>>>
>>>>>> dead_letters_enabled = false
>>>>>> lb_recognition_period_seconds = 3
>>>>>>
>>>>>> mongodb_useauth = true
>>>>>> mongodb_user = <Censored>
>>>>>> mongodb_password = <Censored>
>>>>>> mongodb_replica_set = bne3-0001ladb.server-web.com:27017,
>>>>>> bne3-0002ladb.server-web.com:27017
>>>>>> mongodb_database = graylog2
>>>>>> mongodb_max_connections = 200
>>>>>> mongodb_threads_allowed_to_block_multiplier = 5
>>>>>>
>>>>>> #rules_file = /etc/graylog2.drl
>>>>>>
>>>>>> # Email transport
>>>>>> transport_email_enabled = true
>>>>>> transport_email_hostname = <Censored>
>>>>>> transport_email_port = 25
>>>>>> transport_email_use_auth = false
>>>>>> transport_email_use_tls = false
>>>>>> transport_email_use_ssl = false
>>>>>> transport_email_auth_username = [email protected]
>>>>>> transport_email_auth_password = secret
>>>>>> transport_email_subject_prefix = [graylog2]
>>>>>> transport_email_from_email = <Censored>
>>>>>> transport_email_web_interface_url = <Censored>
>>>>>>
>>>>>> message_cache_off_heap = false
>>>>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool
>>>>>> #message_cache_commit_interval = 1000
>>>>>> #input_cache_max_size = 0
>>>>>>
>>>>>> #ldap_connection_timeout = 2000
>>>>>>
>>>>>> versionchecks = false
>>>>>>
>>>>>> #enable_metrics_collection = false
>>>>>>
>>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>  -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "graylog2" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to