Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Pete GS Wed, 06 May 2015 14:45:18 -0700

I've come back to the office this morning and discovered we had an 
ElasticSearch issue last night which has resulted in lots of unprocessed 
messages in the journal.


All the Graylog nodes are busy processing these and it seems to be slowly 
crunching through them.

Load average (using htop) varies across the four nodes but I'm seeing a 
minimum of 13.59 11.80 and a maximum of 24.81 24.64.

Interestingly enough the process buffer is only full on one of the nodes, 
the other three appear to be 10% full or less.

The output buffers are all empty.

The issue with ElasticSearch was running out of disk space which I've 
resolved for the moment but my business case for new hardware should solve 
that permanently.

What other info can I give you guys to help me look in the right direction?

Cheers, Pete

On Wednesday, 6 May 2015 07:33:31 UTC+10, Pete GS wrote:
>
> Thanks for the replies guys. I'm away from the office today but will check 
> these things tomorrow.
>
> Mathieu, I will check the load average but from memory the 5 minute 
> average was around 12 or 18. I will confirm this tomorrow though.
>
> As for the "co stop" metric, I haven't used esxtop on these hosts but I 
> have looked at the CPU Ready metric and it seems to be ok (sub 5% 
> sustained). One of the physical hosts has exactly the same number of CPU's 
> allocated as the VM"s running on it, but the other two physical hosts have 
> no over-subscription of CPU's at all.There is no memory over subscription 
> on any hosts either.
>
> For the moment I have simply increased the CPU's on the existing nodes as 
> well as adding the two new ones. I am putting together a business case for 
> new hardware for the ElasticSearch cluster and if this goes ahead I will 
> move to a model of more Graylog nodes with less CPU's and memory for each 
> node as I think that will scale better.
>
> Arie, I will increase the output buffer processors tomorrow to see what 
> happens, but I do know that the process buffer gets quite full at times 
> while the output buffer is usually almost empty.
>
> On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <[email protected]
> > wrote:
>
>> Also check « co stop » metric on VMware. I am sure you have too many 
>> vCPUs.
>>
>> Le 5 mai 2015 à 16:21, Arie <[email protected]> a écrit :
>>
>> What happens when you raise "outputbuffer_processors = 5" to 
>> "outputbuffer_processors = 10" ?
>>
>> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS:
>>>
>>> Yesterday I did a yum update on all Graylog and MongoDB nodes and since 
>>> doing that and rebooting them all (there was a kernel update) it seems that 
>>> there are no longer issues connecting to the Mongo database.
>>>
>>> However, I'm still seeing excessively high CPU usage on the Graylog 
>>> nodes where all vCPU's are regularly exceeding 95%.
>>>
>>> What can contribute to this? I'm a little stumped at present.
>>>
>>> I would say our average messages/second is around 5,000 to 6,000 with 
>>> peaks up to about 12,000.
>>>
>>> Cheers, Pete
>>>
>>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote:
>>>>
>>>> Does anyone have any thoughts on this?
>>>>
>>>> Even if someone could identify some scenarios that would cause high CPU 
>>>> on Graylog servers and in what circumstances Graylog would have trouble 
>>>> contacting the MongoDB servers.
>>>>
>>>> Cheers, Pete
>>>>
>>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We acquired a company a while ago and last week we added all of their 
>>>>> logs to our Graylog environment which all come in from their Syslog 
>>>>> server 
>>>>> via UDP.
>>>>>
>>>>> After this, I noticed that the Graylog servers were maxing CPU so to 
>>>>> alleviate this I increased CPU resources to the existing servers and 
>>>>> added 
>>>>> two new servers.
>>>>>
>>>>> I'm still seeing generally high CPU usage with peaks of 100% on all 
>>>>> four of the Graylog servers but I now have issues where they also seem to 
>>>>> have issues connecting to MongoDB.
>>>>>
>>>>> I see lots of "[NodePingThread] Did not find meta info of this node. 
>>>>> Re-registering." streaming through the log files but it only seems to 
>>>>> happen when I have more than two Graylog servers running.
>>>>>
>>>>> I have verified NTP is installed and configured and all servers 
>>>>> including the MongoDB and ElasticSearch servers are sync'ing with the 
>>>>> same 
>>>>> NTP servers.
>>>>>
>>>>> We're doing less than 10,000 messages per second so with the resources 
>>>>> I've allocated I would have expected no issues whatsoever.
>>>>>
>>>>> I have seen this link: 
>>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but 
>>>>> I don't believe it is our issue.
>>>>>
>>>>> If it truly is being caused by doing lots of reverse DNS lookups, I 
>>>>> would expect tcpdump to show me that traffic to our DNS servers, but I 
>>>>> see 
>>>>> almost no DNS lookups at all.
>>>>>
>>>>> We have 6 inputs in total but only one receives the bulk of the Syslog 
>>>>> UDP messages. Most of the other inputs are GELF UDP inputs.
>>>>>
>>>>> We also have 11 streams, however pausing these streams seems to have 
>>>>> little to no impact on the CPU usage.
>>>>>
>>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update 2 
>>>>> with plenty of physical hardware available to service the workload 
>>>>> (little 
>>>>> to no contention).
>>>>>
>>>>> The original two have 20 vCPU's and 32GB RAM, the additional two have 
>>>>> 16 vCPU's and 32GB RAM.
>>>>>
>>>>> Java heap on all is set to 16GB.
>>>>>
>>>>> This is all running on CentOS 6.
>>>>>
>>>>> Any input would be greatly appreciated as I'm a bit stumped on how to 
>>>>> get this resolved at present.
>>>>>
>>>>> Here is the config file I'm using (censored where appropriate):
>>>>>
>>>>> is_master = false
>>>>> node_id_file = /etc/graylog2/server/node-id
>>>>> password_secret = <Censored>
>>>>> root_username = <Censored>
>>>>> root_password_sha2 = <Censored>
>>>>> plugin_dir = /usr/share/graylog2-server/plugin
>>>>> rest_listen_uri = http://172.22.20.66:12900/
>>>>>
>>>>> elasticsearch_max_docs_per_index = 20000000
>>>>> elasticsearch_max_number_of_indices = 999
>>>>> retention_strategy = close
>>>>> elasticsearch_shards = 4
>>>>> elasticsearch_replicas = 1
>>>>> elasticsearch_index_prefix = graylog2
>>>>> allow_leading_wildcard_searches = true
>>>>> allow_highlighting = true
>>>>> elasticsearch_cluster_name = graylog2
>>>>> elasticsearch_node_name = bne3-0002las
>>>>> elasticsearch_node_master = false
>>>>> elasticsearch_node_data = false
>>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false
>>>>> elasticsearch_discovery_zen_ping_unicast_hosts = 
>>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,
>>>>> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,
>>>>> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,
>>>>> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,
>>>>> bne3-0009lai.server-web.com:9300
>>>>> elasticsearch_cluster_discovery_timeout = 5000
>>>>> elasticsearch_discovery_initial_state_timeout = 3s
>>>>> elasticsearch_analyzer = standard
>>>>>
>>>>> output_batch_size = 5000
>>>>> output_flush_interval = 1
>>>>> processbuffer_processors = 20
>>>>> outputbuffer_processors = 5
>>>>> #outputbuffer_processor_keep_alive_time = 5000
>>>>> #outputbuffer_processor_threads_core_pool_size = 3
>>>>> #outputbuffer_processor_threads_max_pool_size = 30
>>>>> #udp_recvbuffer_sizes = 1048576
>>>>> processor_wait_strategy = blocking
>>>>> ring_size = 65536
>>>>>
>>>>> inputbuffer_ring_size = 65536
>>>>> inputbuffer_processors = 2
>>>>> inputbuffer_wait_strategy = blocking
>>>>>
>>>>> message_journal_enabled = true
>>>>> message_journal_dir = /var/lib/graylog-server/journal
>>>>> message_journal_max_age = 24h
>>>>> message_journal_max_size = 150gb
>>>>> message_journal_flush_age = 1m
>>>>> message_journal_flush_interval = 1000000
>>>>> message_journal_segment_age = 1h
>>>>> message_journal_segment_size = 1gb
>>>>>
>>>>> dead_letters_enabled = false
>>>>> lb_recognition_period_seconds = 3
>>>>>
>>>>> mongodb_useauth = true
>>>>> mongodb_user = <Censored>
>>>>> mongodb_password = <Censored>
>>>>> mongodb_replica_set = bne3-0001ladb.server-web.com:27017,
>>>>> bne3-0002ladb.server-web.com:27017
>>>>> mongodb_database = graylog2
>>>>> mongodb_max_connections = 200
>>>>> mongodb_threads_allowed_to_block_multiplier = 5
>>>>>
>>>>> #rules_file = /etc/graylog2.drl
>>>>>
>>>>> # Email transport
>>>>> transport_email_enabled = true
>>>>> transport_email_hostname = <Censored>
>>>>> transport_email_port = 25
>>>>> transport_email_use_auth = false
>>>>> transport_email_use_tls = false
>>>>> transport_email_use_ssl = false
>>>>> transport_email_auth_username = [email protected]
>>>>> transport_email_auth_password = secret
>>>>> transport_email_subject_prefix = [graylog2]
>>>>> transport_email_from_email = <Censored>
>>>>> transport_email_web_interface_url = <Censored>
>>>>>
>>>>> message_cache_off_heap = false
>>>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool
>>>>> #message_cache_commit_interval = 1000
>>>>> #input_cache_max_size = 0
>>>>>
>>>>> #ldap_connection_timeout = 2000
>>>>>
>>>>> versionchecks = false
>>>>>
>>>>> #enable_metrics_collection = false
>>>>>
>>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "graylog2" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>  -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "graylog2" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to