Thanks very much Arie, I will check these tomorrow and report back. One thing I can confirm is the heap size is configured correctly.
Cheers, Pete > On 14 May 2015, at 05:35, Arie <[email protected]> wrote: > > Lets try some more options. > > I see you are running your stuf virtual. Then you can consider the following > for centos6 > > In your startup kernel config you can add the following options > (/etc/grub.conf) > > nohz=off (for high cpu intensive systems) > elevator=noop (disc scheduling is done by the virtual layer, so disable > that) > cgroup_disable=memory (possibly not used, it fees up some memory and > allocation) > > if you use the pvscsi device, add the following: > vmw_pvscsi.cmd_per_lun=254 > vmw_pvscsi.ring_pages=32 > > Check disk buffers on the virtual layer too. vmware kb 2053145 > see > http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2053145&sliceId=1&docTypeID=DT_KB_1_1&dialogID=621755330&stateId=1%200%20593866502 > > Optimize your disk for performance (up to 30%!!! yes): > > for the filesystems were graylog and or elastic is located add the following > to /etc/fstab > > example: > /dev/mapper/vg_nagios-lv_root / ext4 > defaults,noatime,nobarrier,data=writeback 1 1 > and if you want to be more safe: > /dev/mapper/vg_nagios-lv_root / ext4 defaults,noatime,nobarrier 1 1 > > is ES_HEAP_SIZE configured @ the correct place (I did that wrong at first) > it is in /etc/systconfig/elasticsearch > > > All these options together can improve system performance huge specially when > they are virtial. > > ps did you correctly changed your file descriptors? > > /etc/sysctl.conf > > fs.file-max = 65536 > > /etc/security/limits.conf > > * soft nproc 65535 > * hard nproc 65535 > * soft nofile 65535 > * hard nofile 65535 > > /etc/security/limits.d/90-nproc.conf > > * soft nproc 65535 > * hard nproc 65535 > * soft nofile 65535 > * hard nofile 65535 > > check fs performance with iota -a to see how it is. > > hth,, > > Arie > > > Op dinsdag 12 mei 2015 23:52:19 UTC+2 schreef Pete GS: >> >> No further input on this? >> >> The Graylog master node now seems to regularly drop out also with the "Did >> not find meta info of this node. Re-registering." message and it is under no >> load as our load balancer doesn't direct any input messages to it. >> >> Cheers, Pete >> >>> On Thursday, 7 May 2015 07:44:41 UTC+10, Pete GS wrote: >>> I've come back to the office this morning and discovered we had an >>> ElasticSearch issue last night which has resulted in lots of unprocessed >>> messages in the journal. >>> >>> All the Graylog nodes are busy processing these and it seems to be slowly >>> crunching through them. >>> >>> Load average (using htop) varies across the four nodes but I'm seeing a >>> minimum of 13.59 11.80 and a maximum of 24.81 24.64. >>> >>> Interestingly enough the process buffer is only full on one of the nodes, >>> the other three appear to be 10% full or less. >>> >>> The output buffers are all empty. >>> >>> The issue with ElasticSearch was running out of disk space which I've >>> resolved for the moment but my business case for new hardware should solve >>> that permanently. >>> >>> What other info can I give you guys to help me look in the right direction? >>> >>> Cheers, Pete >>> >>>> On Wednesday, 6 May 2015 07:33:31 UTC+10, Pete GS wrote: >>>> Thanks for the replies guys. I'm away from the office today but will check >>>> these things tomorrow. >>>> >>>> Mathieu, I will check the load average but from memory the 5 minute >>>> average was around 12 or 18. I will confirm this tomorrow though. >>>> >>>> As for the "co stop" metric, I haven't used esxtop on these hosts but I >>>> have looked at the CPU Ready metric and it seems to be ok (sub 5% >>>> sustained). One of the physical hosts has exactly the same number of CPU's >>>> allocated as the VM"s running on it, but the other two physical hosts have >>>> no over-subscription of CPU's at all.There is no memory over subscription >>>> on any hosts either. >>>> >>>> For the moment I have simply increased the CPU's on the existing nodes as >>>> well as adding the two new ones. I am putting together a business case for >>>> new hardware for the ElasticSearch cluster and if this goes ahead I will >>>> move to a model of more Graylog nodes with less CPU's and memory for each >>>> node as I think that will scale better. >>>> >>>> Arie, I will increase the output buffer processors tomorrow to see what >>>> happens, but I do know that the process buffer gets quite full at times >>>> while the output buffer is usually almost empty. >>>> >>>>> On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <[email protected]> >>>>> wrote: >>>>> Also check « co stop » metric on VMware. I am sure you have too many >>>>> vCPUs. >>>>> >>>>>> Le 5 mai 2015 à 16:21, Arie <[email protected]> a écrit : >>>>>> >>>>>> What happens when you raise "outputbuffer_processors = 5" to >>>>>> "outputbuffer_processors = 10" ? >>>>>> >>>>>> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS: >>>>>>> >>>>>>> Yesterday I did a yum update on all Graylog and MongoDB nodes and since >>>>>>> doing that and rebooting them all (there was a kernel update) it seems >>>>>>> that there are no longer issues connecting to the Mongo database. >>>>>>> >>>>>>> However, I'm still seeing excessively high CPU usage on the Graylog >>>>>>> nodes where all vCPU's are regularly exceeding 95%. >>>>>>> >>>>>>> What can contribute to this? I'm a little stumped at present. >>>>>>> >>>>>>> I would say our average messages/second is around 5,000 to 6,000 with >>>>>>> peaks up to about 12,000. >>>>>>> >>>>>>> Cheers, Pete >>>>>>> >>>>>>>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote: >>>>>>>> Does anyone have any thoughts on this? >>>>>>>> >>>>>>>> Even if someone could identify some scenarios that would cause high >>>>>>>> CPU on Graylog servers and in what circumstances Graylog would have >>>>>>>> trouble contacting the MongoDB servers. >>>>>>>> >>>>>>>> Cheers, Pete >>>>>>>> >>>>>>>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> We acquired a company a while ago and last week we added all of their >>>>>>>>> logs to our Graylog environment which all come in from their Syslog >>>>>>>>> server via UDP. >>>>>>>>> >>>>>>>>> After this, I noticed that the Graylog servers were maxing CPU so to >>>>>>>>> alleviate this I increased CPU resources to the existing servers and >>>>>>>>> added two new servers. >>>>>>>>> >>>>>>>>> I'm still seeing generally high CPU usage with peaks of 100% on all >>>>>>>>> four of the Graylog servers but I now have issues where they also >>>>>>>>> seem to have issues connecting to MongoDB. >>>>>>>>> >>>>>>>>> I see lots of "[NodePingThread] Did not find meta info of this node. >>>>>>>>> Re-registering." streaming through the log files but it only seems to >>>>>>>>> happen when I have more than two Graylog servers running. >>>>>>>>> >>>>>>>>> I have verified NTP is installed and configured and all servers >>>>>>>>> including the MongoDB and ElasticSearch servers are sync'ing with the >>>>>>>>> same NTP servers. >>>>>>>>> >>>>>>>>> We're doing less than 10,000 messages per second so with the >>>>>>>>> resources I've allocated I would have expected no issues whatsoever. >>>>>>>>> >>>>>>>>> I have seen this link: >>>>>>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI >>>>>>>>> but I don't believe it is our issue. >>>>>>>>> >>>>>>>>> If it truly is being caused by doing lots of reverse DNS lookups, I >>>>>>>>> would expect tcpdump to show me that traffic to our DNS servers, but >>>>>>>>> I see almost no DNS lookups at all. >>>>>>>>> >>>>>>>>> We have 6 inputs in total but only one receives the bulk of the >>>>>>>>> Syslog UDP messages. Most of the other inputs are GELF UDP inputs. >>>>>>>>> >>>>>>>>> We also have 11 streams, however pausing these streams seems to have >>>>>>>>> little to no impact on the CPU usage. >>>>>>>>> >>>>>>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update >>>>>>>>> 2 with plenty of physical hardware available to service the workload >>>>>>>>> (little to no contention). >>>>>>>>> >>>>>>>>> The original two have 20 vCPU's and 32GB RAM, the additional two have >>>>>>>>> 16 vCPU's and 32GB RAM. >>>>>>>>> >>>>>>>>> Java heap on all is set to 16GB. >>>>>>>>> >>>>>>>>> This is all running on CentOS 6. >>>>>>>>> >>>>>>>>> Any input would be greatly appreciated as I'm a bit stumped on how to >>>>>>>>> get this resolved at present. >>>>>>>>> >>>>>>>>> Here is the config file I'm using (censored where appropriate): >>>>>>>>> >>>>>>>>> is_master = false >>>>>>>>> node_id_file = /etc/graylog2/server/node-id >>>>>>>>> password_secret = <Censored> >>>>>>>>> root_username = <Censored> >>>>>>>>> root_password_sha2 = <Censored> >>>>>>>>> plugin_dir = /usr/share/graylog2-server/plugin >>>>>>>>> rest_listen_uri = http://172.22.20.66:12900/ >>>>>>>>> >>>>>>>>> elasticsearch_max_docs_per_index = 20000000 >>>>>>>>> elasticsearch_max_number_of_indices = 999 >>>>>>>>> retention_strategy = close >>>>>>>>> elasticsearch_shards = 4 >>>>>>>>> elasticsearch_replicas = 1 >>>>>>>>> elasticsearch_index_prefix = graylog2 >>>>>>>>> allow_leading_wildcard_searches = true >>>>>>>>> allow_highlighting = true >>>>>>>>> elasticsearch_cluster_name = graylog2 >>>>>>>>> elasticsearch_node_name = bne3-0002las >>>>>>>>> elasticsearch_node_master = false >>>>>>>>> elasticsearch_node_data = false >>>>>>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false >>>>>>>>> elasticsearch_discovery_zen_ping_unicast_hosts = >>>>>>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,bne3-0009lai.server-web.com:9300 >>>>>>>>> elasticsearch_cluster_discovery_timeout = 5000 >>>>>>>>> elasticsearch_discovery_initial_state_timeout = 3s >>>>>>>>> elasticsearch_analyzer = standard >>>>>>>>> >>>>>>>>> output_batch_size = 5000 >>>>>>>>> output_flush_interval = 1 >>>>>>>>> processbuffer_processors = 20 >>>>>>>>> outputbuffer_processors = 5 >>>>>>>>> #outputbuffer_processor_keep_alive_time = 5000 >>>>>>>>> #outputbuffer_processor_threads_core_pool_size = 3 >>>>>>>>> #outputbuffer_processor_threads_max_pool_size = 30 >>>>>>>>> #udp_recvbuffer_sizes = 1048576 >>>>>>>>> processor_wait_strategy = blocking >>>>>>>>> ring_size = 65536 >>>>>>>>> >>>>>>>>> inputbuffer_ring_size = 65536 >>>>>>>>> inputbuffer_processors = 2 >>>>>>>>> inputbuffer_wait_strategy = blocking >>>>>>>>> >>>>>>>>> message_journal_enabled = true >>>>>>>>> message_journal_dir = /var/lib/graylog-server/journal >>>>>>>>> message_journal_max_age = 24h >>>>>>>>> message_journal_max_size = 150gb >>>>>>>>> message_journal_flush_age = 1m >>>>>>>>> message_journal_flush_interval = 1000000 >>>>>>>>> message_journal_segment_age = 1h >>>>>>>>> message_journal_segment_size = 1gb >>>>>>>>> >>>>>>>>> dead_letters_enabled = false >>>>>>>>> lb_recognition_period_seconds = 3 >>>>>>>>> >>>>>>>>> mongodb_useauth = true >>>>>>>>> mongodb_user = <Censored> >>>>>>>>> mongodb_password = <Censored> >>>>>>>>> mongodb_replica_set = >>>>>>>>> bne3-0001ladb.server-web.com:27017,bne3-0002ladb.server-web.com:27017 >>>>>>>>> mongodb_database = graylog2 >>>>>>>>> mongodb_max_connections = 200 >>>>>>>>> mongodb_threads_allowed_to_block_multiplier = 5 >>>>>>>>> >>>>>>>>> #rules_file = /etc/graylog2.drl >>>>>>>>> >>>>>>>>> # Email transport >>>>>>>>> transport_email_enabled = true >>>>>>>>> transport_email_hostname = <Censored> >>>>>>>>> transport_email_port = 25 >>>>>>>>> transport_email_use_auth = false >>>>>>>>> transport_email_use_tls = false >>>>>>>>> transport_email_use_ssl = false >>>>>>>>> transport_email_auth_username = [email protected] >>>>>>>>> transport_email_auth_password = secret >>>>>>>>> transport_email_subject_prefix = [graylog2] >>>>>>>>> transport_email_from_email = <Censored> >>>>>>>>> transport_email_web_interface_url = <Censored> >>>>>>>>> >>>>>>>>> message_cache_off_heap = false >>>>>>>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool >>>>>>>>> #message_cache_commit_interval = 1000 >>>>>>>>> #input_cache_max_size = 0 >>>>>>>>> >>>>>>>>> #ldap_connection_timeout = 2000 >>>>>>>>> >>>>>>>>> versionchecks = false >>>>>>>>> >>>>>>>>> #enable_metrics_collection = false >>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "graylog2" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>>> an email to [email protected]. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "graylog2" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe. >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to a topic in the Google > Groups "graylog2" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
