Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Pete GS Tue, 05 May 2015 14:34:29 -0700

Thanks for the replies guys. I'm away from the office today but will check
these things tomorrow.


Mathieu, I will check the load average but from memory the 5 minute average
was around 12 or 18. I will confirm this tomorrow though.

As for the "co stop" metric, I haven't used esxtop on these hosts but I
have looked at the CPU Ready metric and it seems to be ok (sub 5%
sustained). One of the physical hosts has exactly the same number of CPU's
allocated as the VM"s running on it, but the other two physical hosts have
no over-subscription of CPU's at all.There is no memory over subscription
on any hosts either.

For the moment I have simply increased the CPU's on the existing nodes as
well as adding the two new ones. I am putting together a business case for
new hardware for the ElasticSearch cluster and if this goes ahead I will
move to a model of more Graylog nodes with less CPU's and memory for each
node as I think that will scale better.

Arie, I will increase the output buffer processors tomorrow to see what
happens, but I do know that the process buffer gets quite full at times
while the output buffer is usually almost empty.

On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <[email protected]>
wrote:

> Also check « co stop » metric on VMware. I am sure you have too many vCPUs.
>
> Le 5 mai 2015 à 16:21, Arie <[email protected]> a écrit :
>
> What happens when you raise "outputbuffer_processors = 5" to
> "outputbuffer_processors = 10" ?
>
> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS:
>>
>> Yesterday I did a yum update on all Graylog and MongoDB nodes and since
>> doing that and rebooting them all (there was a kernel update) it seems that
>> there are no longer issues connecting to the Mongo database.
>>
>> However, I'm still seeing excessively high CPU usage on the Graylog nodes
>> where all vCPU's are regularly exceeding 95%.
>>
>> What can contribute to this? I'm a little stumped at present.
>>
>> I would say our average messages/second is around 5,000 to 6,000 with
>> peaks up to about 12,000.
>>
>> Cheers, Pete
>>
>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote:
>>>
>>> Does anyone have any thoughts on this?
>>>
>>> Even if someone could identify some scenarios that would cause high CPU
>>> on Graylog servers and in what circumstances Graylog would have trouble
>>> contacting the MongoDB servers.
>>>
>>> Cheers, Pete
>>>
>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We acquired a company a while ago and last week we added all of their
>>>> logs to our Graylog environment which all come in from their Syslog server
>>>> via UDP.
>>>>
>>>> After this, I noticed that the Graylog servers were maxing CPU so to
>>>> alleviate this I increased CPU resources to the existing servers and added
>>>> two new servers.
>>>>
>>>> I'm still seeing generally high CPU usage with peaks of 100% on all
>>>> four of the Graylog servers but I now have issues where they also seem to
>>>> have issues connecting to MongoDB.
>>>>
>>>> I see lots of "[NodePingThread] Did not find meta info of this node.
>>>> Re-registering." streaming through the log files but it only seems to
>>>> happen when I have more than two Graylog servers running.
>>>>
>>>> I have verified NTP is installed and configured and all servers
>>>> including the MongoDB and ElasticSearch servers are sync'ing with the same
>>>> NTP servers.
>>>>
>>>> We're doing less than 10,000 messages per second so with the resources
>>>> I've allocated I would have expected no issues whatsoever.
>>>>
>>>> I have seen this link:
>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but
>>>> I don't believe it is our issue.
>>>>
>>>> If it truly is being caused by doing lots of reverse DNS lookups, I
>>>> would expect tcpdump to show me that traffic to our DNS servers, but I see
>>>> almost no DNS lookups at all.
>>>>
>>>> We have 6 inputs in total but only one receives the bulk of the Syslog
>>>> UDP messages. Most of the other inputs are GELF UDP inputs.
>>>>
>>>> We also have 11 streams, however pausing these streams seems to have
>>>> little to no impact on the CPU usage.
>>>>
>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update 2
>>>> with plenty of physical hardware available to service the workload (little
>>>> to no contention).
>>>>
>>>> The original two have 20 vCPU's and 32GB RAM, the additional two have
>>>> 16 vCPU's and 32GB RAM.
>>>>
>>>> Java heap on all is set to 16GB.
>>>>
>>>> This is all running on CentOS 6.
>>>>
>>>> Any input would be greatly appreciated as I'm a bit stumped on how to
>>>> get this resolved at present.
>>>>
>>>> Here is the config file I'm using (censored where appropriate):
>>>>
>>>> is_master = false
>>>> node_id_file = /etc/graylog2/server/node-id
>>>> password_secret = <Censored>
>>>> root_username = <Censored>
>>>> root_password_sha2 = <Censored>
>>>> plugin_dir = /usr/share/graylog2-server/plugin
>>>> rest_listen_uri = http://172.22.20.66:12900/
>>>>
>>>> elasticsearch_max_docs_per_index = 20000000
>>>> elasticsearch_max_number_of_indices = 999
>>>> retention_strategy = close
>>>> elasticsearch_shards = 4
>>>> elasticsearch_replicas = 1
>>>> elasticsearch_index_prefix = graylog2
>>>> allow_leading_wildcard_searches = true
>>>> allow_highlighting = true
>>>> elasticsearch_cluster_name = graylog2
>>>> elasticsearch_node_name = bne3-0002las
>>>> elasticsearch_node_master = false
>>>> elasticsearch_node_data = false
>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false
>>>> elasticsearch_discovery_zen_ping_unicast_hosts =
>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,
>>>> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,
>>>> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,
>>>> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,
>>>> bne3-0009lai.server-web.com:9300
>>>> elasticsearch_cluster_discovery_timeout = 5000
>>>> elasticsearch_discovery_initial_state_timeout = 3s
>>>> elasticsearch_analyzer = standard
>>>>
>>>> output_batch_size = 5000
>>>> output_flush_interval = 1
>>>> processbuffer_processors = 20
>>>> outputbuffer_processors = 5
>>>> #outputbuffer_processor_keep_alive_time = 5000
>>>> #outputbuffer_processor_threads_core_pool_size = 3
>>>> #outputbuffer_processor_threads_max_pool_size = 30
>>>> #udp_recvbuffer_sizes = 1048576
>>>> processor_wait_strategy = blocking
>>>> ring_size = 65536
>>>>
>>>> inputbuffer_ring_size = 65536
>>>> inputbuffer_processors = 2
>>>> inputbuffer_wait_strategy = blocking
>>>>
>>>> message_journal_enabled = true
>>>> message_journal_dir = /var/lib/graylog-server/journal
>>>> message_journal_max_age = 24h
>>>> message_journal_max_size = 150gb
>>>> message_journal_flush_age = 1m
>>>> message_journal_flush_interval = 1000000
>>>> message_journal_segment_age = 1h
>>>> message_journal_segment_size = 1gb
>>>>
>>>> dead_letters_enabled = false
>>>> lb_recognition_period_seconds = 3
>>>>
>>>> mongodb_useauth = true
>>>> mongodb_user = <Censored>
>>>> mongodb_password = <Censored>
>>>> mongodb_replica_set = bne3-0001ladb.server-web.com:27017,
>>>> bne3-0002ladb.server-web.com:27017
>>>> mongodb_database = graylog2
>>>> mongodb_max_connections = 200
>>>> mongodb_threads_allowed_to_block_multiplier = 5
>>>>
>>>> #rules_file = /etc/graylog2.drl
>>>>
>>>> # Email transport
>>>> transport_email_enabled = true
>>>> transport_email_hostname = <Censored>
>>>> transport_email_port = 25
>>>> transport_email_use_auth = false
>>>> transport_email_use_tls = false
>>>> transport_email_use_ssl = false
>>>> transport_email_auth_username = [email protected]
>>>> transport_email_auth_password = secret
>>>> transport_email_subject_prefix = [graylog2]
>>>> transport_email_from_email = <Censored>
>>>> transport_email_web_interface_url = <Censored>
>>>>
>>>> message_cache_off_heap = false
>>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool
>>>> #message_cache_commit_interval = 1000
>>>> #input_cache_max_size = 0
>>>>
>>>> #ldap_connection_timeout = 2000
>>>>
>>>> versionchecks = false
>>>>
>>>> #enable_metrics_collection = false
>>>>
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "graylog2" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to