No further input on this? The Graylog master node now seems to regularly drop out also with the "Did not find meta info of this node. Re-registering." message and it is under no load as our load balancer doesn't direct any input messages to it.
Cheers, Pete On Thursday, 7 May 2015 07:44:41 UTC+10, Pete GS wrote: > > I've come back to the office this morning and discovered we had an > ElasticSearch issue last night which has resulted in lots of unprocessed > messages in the journal. > > All the Graylog nodes are busy processing these and it seems to be slowly > crunching through them. > > Load average (using htop) varies across the four nodes but I'm seeing a > minimum of 13.59 11.80 and a maximum of 24.81 24.64. > > Interestingly enough the process buffer is only full on one of the nodes, > the other three appear to be 10% full or less. > > The output buffers are all empty. > > The issue with ElasticSearch was running out of disk space which I've > resolved for the moment but my business case for new hardware should solve > that permanently. > > What other info can I give you guys to help me look in the right direction? > > Cheers, Pete > > On Wednesday, 6 May 2015 07:33:31 UTC+10, Pete GS wrote: >> >> Thanks for the replies guys. I'm away from the office today but will >> check these things tomorrow. >> >> Mathieu, I will check the load average but from memory the 5 minute >> average was around 12 or 18. I will confirm this tomorrow though. >> >> As for the "co stop" metric, I haven't used esxtop on these hosts but I >> have looked at the CPU Ready metric and it seems to be ok (sub 5% >> sustained). One of the physical hosts has exactly the same number of CPU's >> allocated as the VM"s running on it, but the other two physical hosts have >> no over-subscription of CPU's at all.There is no memory over subscription >> on any hosts either. >> >> For the moment I have simply increased the CPU's on the existing nodes as >> well as adding the two new ones. I am putting together a business case for >> new hardware for the ElasticSearch cluster and if this goes ahead I will >> move to a model of more Graylog nodes with less CPU's and memory for each >> node as I think that will scale better. >> >> Arie, I will increase the output buffer processors tomorrow to see what >> happens, but I do know that the process buffer gets quite full at times >> while the output buffer is usually almost empty. >> >> On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek < >> [email protected]> wrote: >> >>> Also check « co stop » metric on VMware. I am sure you have too many >>> vCPUs. >>> >>> Le 5 mai 2015 à 16:21, Arie <[email protected]> a écrit : >>> >>> What happens when you raise "outputbuffer_processors = 5" to >>> "outputbuffer_processors = 10" ? >>> >>> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS: >>>> >>>> Yesterday I did a yum update on all Graylog and MongoDB nodes and since >>>> doing that and rebooting them all (there was a kernel update) it seems >>>> that >>>> there are no longer issues connecting to the Mongo database. >>>> >>>> However, I'm still seeing excessively high CPU usage on the Graylog >>>> nodes where all vCPU's are regularly exceeding 95%. >>>> >>>> What can contribute to this? I'm a little stumped at present. >>>> >>>> I would say our average messages/second is around 5,000 to 6,000 with >>>> peaks up to about 12,000. >>>> >>>> Cheers, Pete >>>> >>>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote: >>>>> >>>>> Does anyone have any thoughts on this? >>>>> >>>>> Even if someone could identify some scenarios that would cause high >>>>> CPU on Graylog servers and in what circumstances Graylog would have >>>>> trouble >>>>> contacting the MongoDB servers. >>>>> >>>>> Cheers, Pete >>>>> >>>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> We acquired a company a while ago and last week we added all of their >>>>>> logs to our Graylog environment which all come in from their Syslog >>>>>> server >>>>>> via UDP. >>>>>> >>>>>> After this, I noticed that the Graylog servers were maxing CPU so to >>>>>> alleviate this I increased CPU resources to the existing servers and >>>>>> added >>>>>> two new servers. >>>>>> >>>>>> I'm still seeing generally high CPU usage with peaks of 100% on all >>>>>> four of the Graylog servers but I now have issues where they also seem >>>>>> to >>>>>> have issues connecting to MongoDB. >>>>>> >>>>>> I see lots of "[NodePingThread] Did not find meta info of this node. >>>>>> Re-registering." streaming through the log files but it only seems to >>>>>> happen when I have more than two Graylog servers running. >>>>>> >>>>>> I have verified NTP is installed and configured and all servers >>>>>> including the MongoDB and ElasticSearch servers are sync'ing with the >>>>>> same >>>>>> NTP servers. >>>>>> >>>>>> We're doing less than 10,000 messages per second so with the >>>>>> resources I've allocated I would have expected no issues whatsoever. >>>>>> >>>>>> I have seen this link: >>>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but >>>>>> I don't believe it is our issue. >>>>>> >>>>>> If it truly is being caused by doing lots of reverse DNS lookups, I >>>>>> would expect tcpdump to show me that traffic to our DNS servers, but I >>>>>> see >>>>>> almost no DNS lookups at all. >>>>>> >>>>>> We have 6 inputs in total but only one receives the bulk of the >>>>>> Syslog UDP messages. Most of the other inputs are GELF UDP inputs. >>>>>> >>>>>> We also have 11 streams, however pausing these streams seems to have >>>>>> little to no impact on the CPU usage. >>>>>> >>>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update >>>>>> 2 with plenty of physical hardware available to service the workload >>>>>> (little to no contention). >>>>>> >>>>>> The original two have 20 vCPU's and 32GB RAM, the additional two have >>>>>> 16 vCPU's and 32GB RAM. >>>>>> >>>>>> Java heap on all is set to 16GB. >>>>>> >>>>>> This is all running on CentOS 6. >>>>>> >>>>>> Any input would be greatly appreciated as I'm a bit stumped on how to >>>>>> get this resolved at present. >>>>>> >>>>>> Here is the config file I'm using (censored where appropriate): >>>>>> >>>>>> is_master = false >>>>>> node_id_file = /etc/graylog2/server/node-id >>>>>> password_secret = <Censored> >>>>>> root_username = <Censored> >>>>>> root_password_sha2 = <Censored> >>>>>> plugin_dir = /usr/share/graylog2-server/plugin >>>>>> rest_listen_uri = http://172.22.20.66:12900/ >>>>>> >>>>>> elasticsearch_max_docs_per_index = 20000000 >>>>>> elasticsearch_max_number_of_indices = 999 >>>>>> retention_strategy = close >>>>>> elasticsearch_shards = 4 >>>>>> elasticsearch_replicas = 1 >>>>>> elasticsearch_index_prefix = graylog2 >>>>>> allow_leading_wildcard_searches = true >>>>>> allow_highlighting = true >>>>>> elasticsearch_cluster_name = graylog2 >>>>>> elasticsearch_node_name = bne3-0002las >>>>>> elasticsearch_node_master = false >>>>>> elasticsearch_node_data = false >>>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false >>>>>> elasticsearch_discovery_zen_ping_unicast_hosts = >>>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300, >>>>>> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300, >>>>>> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300, >>>>>> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300, >>>>>> bne3-0009lai.server-web.com:9300 >>>>>> elasticsearch_cluster_discovery_timeout = 5000 >>>>>> elasticsearch_discovery_initial_state_timeout = 3s >>>>>> elasticsearch_analyzer = standard >>>>>> >>>>>> output_batch_size = 5000 >>>>>> output_flush_interval = 1 >>>>>> processbuffer_processors = 20 >>>>>> outputbuffer_processors = 5 >>>>>> #outputbuffer_processor_keep_alive_time = 5000 >>>>>> #outputbuffer_processor_threads_core_pool_size = 3 >>>>>> #outputbuffer_processor_threads_max_pool_size = 30 >>>>>> #udp_recvbuffer_sizes = 1048576 >>>>>> processor_wait_strategy = blocking >>>>>> ring_size = 65536 >>>>>> >>>>>> inputbuffer_ring_size = 65536 >>>>>> inputbuffer_processors = 2 >>>>>> inputbuffer_wait_strategy = blocking >>>>>> >>>>>> message_journal_enabled = true >>>>>> message_journal_dir = /var/lib/graylog-server/journal >>>>>> message_journal_max_age = 24h >>>>>> message_journal_max_size = 150gb >>>>>> message_journal_flush_age = 1m >>>>>> message_journal_flush_interval = 1000000 >>>>>> message_journal_segment_age = 1h >>>>>> message_journal_segment_size = 1gb >>>>>> >>>>>> dead_letters_enabled = false >>>>>> lb_recognition_period_seconds = 3 >>>>>> >>>>>> mongodb_useauth = true >>>>>> mongodb_user = <Censored> >>>>>> mongodb_password = <Censored> >>>>>> mongodb_replica_set = bne3-0001ladb.server-web.com:27017, >>>>>> bne3-0002ladb.server-web.com:27017 >>>>>> mongodb_database = graylog2 >>>>>> mongodb_max_connections = 200 >>>>>> mongodb_threads_allowed_to_block_multiplier = 5 >>>>>> >>>>>> #rules_file = /etc/graylog2.drl >>>>>> >>>>>> # Email transport >>>>>> transport_email_enabled = true >>>>>> transport_email_hostname = <Censored> >>>>>> transport_email_port = 25 >>>>>> transport_email_use_auth = false >>>>>> transport_email_use_tls = false >>>>>> transport_email_use_ssl = false >>>>>> transport_email_auth_username = [email protected] >>>>>> transport_email_auth_password = secret >>>>>> transport_email_subject_prefix = [graylog2] >>>>>> transport_email_from_email = <Censored> >>>>>> transport_email_web_interface_url = <Censored> >>>>>> >>>>>> message_cache_off_heap = false >>>>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool >>>>>> #message_cache_commit_interval = 1000 >>>>>> #input_cache_max_size = 0 >>>>>> >>>>>> #ldap_connection_timeout = 2000 >>>>>> >>>>>> versionchecks = false >>>>>> >>>>>> #enable_metrics_collection = false >>>>>> >>>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "graylog2" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "graylog2" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
