[graylog2] Re: High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Pete GS Tue, 28 Apr 2015 18:25:19 -0700

Apologies, I should've clarified we're running Graylog 1.0.1.

On Wednesday, April 29, 2015 at 10:34:28 AM UTC+10, Pete GS wrote:
>
> Hi all,
>
> We acquired a company a while ago and last week we added all of their logs 
> to our Graylog environment which all come in from their Syslog server via 
> UDP.
>
> After this, I noticed that the Graylog servers were maxing CPU so to 
> alleviate this I increased CPU resources to the existing servers and added 
> two new servers.
>
> I'm still seeing generally high CPU usage with peaks of 100% on all four 
> of the Graylog servers but I now have issues where they also seem to have 
> issues connecting to MongoDB.
>
> I see lots of "[NodePingThread] Did not find meta info of this node. 
> Re-registering." streaming through the log files but it only seems to 
> happen when I have more than two Graylog servers running.
>
> I have verified NTP is installed and configured and all servers including 
> the MongoDB and ElasticSearch servers are sync'ing with the same NTP 
> servers.
>
> We're doing less than 10,000 messages per second so with the resources 
> I've allocated I would have expected no issues whatsoever.
>
> I have seen this link: 
> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but I 
> don't believe it is our issue.
>
> If it truly is being caused by doing lots of reverse DNS lookups, I would 
> expect tcpdump to show me that traffic to our DNS servers, but I see almost 
> no DNS lookups at all.
>
> We have 6 inputs in total but only one receives the bulk of the Syslog UDP 
> messages. Most of the other inputs are GELF UDP inputs.
>
> We also have 11 streams, however pausing these streams seems to have 
> little to no impact on the CPU usage.
>
> All the Graylog servers are virtualised on top of vSphere 5.5 Update 2 
> with plenty of physical hardware available to service the workload (little 
> to no contention).
>
> The original two have 20 vCPU's and 32GB RAM, the additional two have 16 
> vCPU's and 32GB RAM.
>
> Java heap on all is set to 16GB.
>
> This is all running on CentOS 6.
>
> Any input would be greatly appreciated as I'm a bit stumped on how to get 
> this resolved at present.
>
> Here is the config file I'm using (censored where appropriate):
>
> is_master = false
> node_id_file = /etc/graylog2/server/node-id
> password_secret = <Censored>
> root_username = <Censored>
> root_password_sha2 = <Censored>
> plugin_dir = /usr/share/graylog2-server/plugin
> rest_listen_uri = http://172.22.20.66:12900/
>
> elasticsearch_max_docs_per_index = 20000000
> elasticsearch_max_number_of_indices = 999
> retention_strategy = close
> elasticsearch_shards = 4
> elasticsearch_replicas = 1
> elasticsearch_index_prefix = graylog2
> allow_leading_wildcard_searches = true
> allow_highlighting = true
> elasticsearch_cluster_name = graylog2
> elasticsearch_node_name = bne3-0002las
> elasticsearch_node_master = false
> elasticsearch_node_data = false
> elasticsearch_discovery_zen_ping_multicast_enabled = false
> elasticsearch_discovery_zen_ping_unicast_hosts = 
> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,
> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,
> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,
> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,
> bne3-0009lai.server-web.com:9300
> elasticsearch_cluster_discovery_timeout = 5000
> elasticsearch_discovery_initial_state_timeout = 3s
> elasticsearch_analyzer = standard
>
> output_batch_size = 5000
> output_flush_interval = 1
> processbuffer_processors = 20
> outputbuffer_processors = 5
> #outputbuffer_processor_keep_alive_time = 5000
> #outputbuffer_processor_threads_core_pool_size = 3
> #outputbuffer_processor_threads_max_pool_size = 30
> #udp_recvbuffer_sizes = 1048576
> processor_wait_strategy = blocking
> ring_size = 65536
>
> inputbuffer_ring_size = 65536
> inputbuffer_processors = 2
> inputbuffer_wait_strategy = blocking
>
> message_journal_enabled = true
> message_journal_dir = /var/lib/graylog-server/journal
> message_journal_max_age = 24h
> message_journal_max_size = 150gb
> message_journal_flush_age = 1m
> message_journal_flush_interval = 1000000
> message_journal_segment_age = 1h
> message_journal_segment_size = 1gb
>
> dead_letters_enabled = false
> lb_recognition_period_seconds = 3
>
> mongodb_useauth = true
> mongodb_user = <Censored>
> mongodb_password = <Censored>
> mongodb_replica_set = bne3-0001ladb.server-web.com:27017,
> bne3-0002ladb.server-web.com:27017
> mongodb_database = graylog2
> mongodb_max_connections = 200
> mongodb_threads_allowed_to_block_multiplier = 5
>
> #rules_file = /etc/graylog2.drl
>
> # Email transport
> transport_email_enabled = true
> transport_email_hostname = <Censored>
> transport_email_port = 25
> transport_email_use_auth = false
> transport_email_use_tls = false
> transport_email_use_ssl = false
> transport_email_auth_username = [email protected]
> transport_email_auth_password = secret
> transport_email_subject_prefix = [graylog2]
> transport_email_from_email = <Censored>
> transport_email_web_interface_url = <Censored>
>
> message_cache_off_heap = false
> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool
> #message_cache_commit_interval = 1000
> #input_cache_max_size = 0
>
> #ldap_connection_timeout = 2000
>
> versionchecks = false
>
> #enable_metrics_collection = false
>


-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to