Thanks for the replies guys. I'm away from the office today but will check these things tomorrow.
Mathieu, I will check the load average but from memory the 5 minute average was around 12 or 18. I will confirm this tomorrow though. As for the "co stop" metric, I haven't used esxtop on these hosts but I have looked at the CPU Ready metric and it seems to be ok (sub 5% sustained). One of the physical hosts has exactly the same number of CPU's allocated as the VM"s running on it, but the other two physical hosts have no over-subscription of CPU's at all.There is no memory over subscription on any hosts either. For the moment I have simply increased the CPU's on the existing nodes as well as adding the two new ones. I am putting together a business case for new hardware for the ElasticSearch cluster and if this goes ahead I will move to a model of more Graylog nodes with less CPU's and memory for each node as I think that will scale better. Arie, I will increase the output buffer processors tomorrow to see what happens, but I do know that the process buffer gets quite full at times while the output buffer is usually almost empty. On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <[email protected]> wrote: > Also check « co stop » metric on VMware. I am sure you have too many vCPUs. > > Le 5 mai 2015 à 16:21, Arie <[email protected]> a écrit : > > What happens when you raise "outputbuffer_processors = 5" to > "outputbuffer_processors = 10" ? > > Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS: >> >> Yesterday I did a yum update on all Graylog and MongoDB nodes and since >> doing that and rebooting them all (there was a kernel update) it seems that >> there are no longer issues connecting to the Mongo database. >> >> However, I'm still seeing excessively high CPU usage on the Graylog nodes >> where all vCPU's are regularly exceeding 95%. >> >> What can contribute to this? I'm a little stumped at present. >> >> I would say our average messages/second is around 5,000 to 6,000 with >> peaks up to about 12,000. >> >> Cheers, Pete >> >> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote: >>> >>> Does anyone have any thoughts on this? >>> >>> Even if someone could identify some scenarios that would cause high CPU >>> on Graylog servers and in what circumstances Graylog would have trouble >>> contacting the MongoDB servers. >>> >>> Cheers, Pete >>> >>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote: >>>> >>>> Hi all, >>>> >>>> We acquired a company a while ago and last week we added all of their >>>> logs to our Graylog environment which all come in from their Syslog server >>>> via UDP. >>>> >>>> After this, I noticed that the Graylog servers were maxing CPU so to >>>> alleviate this I increased CPU resources to the existing servers and added >>>> two new servers. >>>> >>>> I'm still seeing generally high CPU usage with peaks of 100% on all >>>> four of the Graylog servers but I now have issues where they also seem to >>>> have issues connecting to MongoDB. >>>> >>>> I see lots of "[NodePingThread] Did not find meta info of this node. >>>> Re-registering." streaming through the log files but it only seems to >>>> happen when I have more than two Graylog servers running. >>>> >>>> I have verified NTP is installed and configured and all servers >>>> including the MongoDB and ElasticSearch servers are sync'ing with the same >>>> NTP servers. >>>> >>>> We're doing less than 10,000 messages per second so with the resources >>>> I've allocated I would have expected no issues whatsoever. >>>> >>>> I have seen this link: >>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but >>>> I don't believe it is our issue. >>>> >>>> If it truly is being caused by doing lots of reverse DNS lookups, I >>>> would expect tcpdump to show me that traffic to our DNS servers, but I see >>>> almost no DNS lookups at all. >>>> >>>> We have 6 inputs in total but only one receives the bulk of the Syslog >>>> UDP messages. Most of the other inputs are GELF UDP inputs. >>>> >>>> We also have 11 streams, however pausing these streams seems to have >>>> little to no impact on the CPU usage. >>>> >>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update 2 >>>> with plenty of physical hardware available to service the workload (little >>>> to no contention). >>>> >>>> The original two have 20 vCPU's and 32GB RAM, the additional two have >>>> 16 vCPU's and 32GB RAM. >>>> >>>> Java heap on all is set to 16GB. >>>> >>>> This is all running on CentOS 6. >>>> >>>> Any input would be greatly appreciated as I'm a bit stumped on how to >>>> get this resolved at present. >>>> >>>> Here is the config file I'm using (censored where appropriate): >>>> >>>> is_master = false >>>> node_id_file = /etc/graylog2/server/node-id >>>> password_secret = <Censored> >>>> root_username = <Censored> >>>> root_password_sha2 = <Censored> >>>> plugin_dir = /usr/share/graylog2-server/plugin >>>> rest_listen_uri = http://172.22.20.66:12900/ >>>> >>>> elasticsearch_max_docs_per_index = 20000000 >>>> elasticsearch_max_number_of_indices = 999 >>>> retention_strategy = close >>>> elasticsearch_shards = 4 >>>> elasticsearch_replicas = 1 >>>> elasticsearch_index_prefix = graylog2 >>>> allow_leading_wildcard_searches = true >>>> allow_highlighting = true >>>> elasticsearch_cluster_name = graylog2 >>>> elasticsearch_node_name = bne3-0002las >>>> elasticsearch_node_master = false >>>> elasticsearch_node_data = false >>>> elasticsearch_discovery_zen_ping_multicast_enabled = false >>>> elasticsearch_discovery_zen_ping_unicast_hosts = >>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300, >>>> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300, >>>> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300, >>>> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300, >>>> bne3-0009lai.server-web.com:9300 >>>> elasticsearch_cluster_discovery_timeout = 5000 >>>> elasticsearch_discovery_initial_state_timeout = 3s >>>> elasticsearch_analyzer = standard >>>> >>>> output_batch_size = 5000 >>>> output_flush_interval = 1 >>>> processbuffer_processors = 20 >>>> outputbuffer_processors = 5 >>>> #outputbuffer_processor_keep_alive_time = 5000 >>>> #outputbuffer_processor_threads_core_pool_size = 3 >>>> #outputbuffer_processor_threads_max_pool_size = 30 >>>> #udp_recvbuffer_sizes = 1048576 >>>> processor_wait_strategy = blocking >>>> ring_size = 65536 >>>> >>>> inputbuffer_ring_size = 65536 >>>> inputbuffer_processors = 2 >>>> inputbuffer_wait_strategy = blocking >>>> >>>> message_journal_enabled = true >>>> message_journal_dir = /var/lib/graylog-server/journal >>>> message_journal_max_age = 24h >>>> message_journal_max_size = 150gb >>>> message_journal_flush_age = 1m >>>> message_journal_flush_interval = 1000000 >>>> message_journal_segment_age = 1h >>>> message_journal_segment_size = 1gb >>>> >>>> dead_letters_enabled = false >>>> lb_recognition_period_seconds = 3 >>>> >>>> mongodb_useauth = true >>>> mongodb_user = <Censored> >>>> mongodb_password = <Censored> >>>> mongodb_replica_set = bne3-0001ladb.server-web.com:27017, >>>> bne3-0002ladb.server-web.com:27017 >>>> mongodb_database = graylog2 >>>> mongodb_max_connections = 200 >>>> mongodb_threads_allowed_to_block_multiplier = 5 >>>> >>>> #rules_file = /etc/graylog2.drl >>>> >>>> # Email transport >>>> transport_email_enabled = true >>>> transport_email_hostname = <Censored> >>>> transport_email_port = 25 >>>> transport_email_use_auth = false >>>> transport_email_use_tls = false >>>> transport_email_use_ssl = false >>>> transport_email_auth_username = [email protected] >>>> transport_email_auth_password = secret >>>> transport_email_subject_prefix = [graylog2] >>>> transport_email_from_email = <Censored> >>>> transport_email_web_interface_url = <Censored> >>>> >>>> message_cache_off_heap = false >>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool >>>> #message_cache_commit_interval = 1000 >>>> #input_cache_max_size = 0 >>>> >>>> #ldap_connection_timeout = 2000 >>>> >>>> versionchecks = false >>>> >>>> #enable_metrics_collection = false >>>> >>> > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "graylog2" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
