Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Arie Wed, 13 May 2015 12:35:52 -0700

Lets try some more options.

I see you are running your stuf virtual. Then you can consider the 
following for centos6


In your startup kernel config you can add the following options 
(/etc/grub.conf)

  nohz=off (for high cpu intensive systems)
  elevator=noop (disc scheduling is done by the virtual layer, so disable 
that)
  cgroup_disable=memory (possibly not used, it fees up some memory and 
allocation)
  
if you use the pvscsi device, add the following:
  vmw_pvscsi.cmd_per_lun=254 
 vmw_pvscsi.ring_pages=32

 Check disk buffers on the virtual layer too. vmware kb 2053145
  
see 
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2053145&sliceId=1&docTypeID=DT_KB_1_1&dialogID=621755330&stateId=1%200%20593866502

 Optimize your disk for performance (up to 30%!!! yes):

 for the filesystems were graylog and or elastic is located add the 
following to /etc/fstab

example:
/dev/mapper/vg_nagios-lv_root /  ext4 
defaults,noatime,nobarrier,data=writeback 1 1
and if you want to be more safe:
/dev/mapper/vg_nagios-lv_root /  ext4 defaults,noatime,nobarrier 1 1    

is ES_HEAP_SIZE configured @ the correct place (I did that wrong at first)
it is in /etc/systconfig/elasticsearch


All these options together can improve system performance huge specially 
when they are virtial.

ps did you correctly changed your file descriptors?

/etc/sysctl.conf

fs.file-max = 65536

 /etc/security/limits.conf

*          soft     nproc       65535
*          hard     nproc       65535
*          soft     nofile      65535
*          hard     nofile      65535

 

 /etc/security/limits.d/90-nproc.conf

*          soft     nproc       65535
*          hard     nproc       65535
*          soft     nofile      65535
*          hard     nofile      65535

check fs performance with iota -a to see how it is.

hth,,

Arie


Op dinsdag 12 mei 2015 23:52:19 UTC+2 schreef Pete GS:
>
> No further input on this?
>
> The Graylog master node now seems to regularly drop out also with the "Did 
> not find meta info of this node. Re-registering." message and it is under 
> no load as our load balancer doesn't direct any input messages to it.
>
> Cheers, Pete
>
> On Thursday, 7 May 2015 07:44:41 UTC+10, Pete GS wrote:
>>
>> I've come back to the office this morning and discovered we had an 
>> ElasticSearch issue last night which has resulted in lots of unprocessed 
>> messages in the journal.
>>
>> All the Graylog nodes are busy processing these and it seems to be slowly 
>> crunching through them.
>>
>> Load average (using htop) varies across the four nodes but I'm seeing a 
>> minimum of 13.59 11.80 and a maximum of 24.81 24.64.
>>
>> Interestingly enough the process buffer is only full on one of the nodes, 
>> the other three appear to be 10% full or less.
>>
>> The output buffers are all empty.
>>
>> The issue with ElasticSearch was running out of disk space which I've 
>> resolved for the moment but my business case for new hardware should solve 
>> that permanently.
>>
>> What other info can I give you guys to help me look in the right 
>> direction?
>>
>> Cheers, Pete
>>
>> On Wednesday, 6 May 2015 07:33:31 UTC+10, Pete GS wrote:
>>>
>>> Thanks for the replies guys. I'm away from the office today but will 
>>> check these things tomorrow.
>>>
>>> Mathieu, I will check the load average but from memory the 5 minute 
>>> average was around 12 or 18. I will confirm this tomorrow though.
>>>
>>> As for the "co stop" metric, I haven't used esxtop on these hosts but I 
>>> have looked at the CPU Ready metric and it seems to be ok (sub 5% 
>>> sustained). One of the physical hosts has exactly the same number of CPU's 
>>> allocated as the VM"s running on it, but the other two physical hosts have 
>>> no over-subscription of CPU's at all.There is no memory over subscription 
>>> on any hosts either.
>>>
>>> For the moment I have simply increased the CPU's on the existing nodes 
>>> as well as adding the two new ones. I am putting together a business case 
>>> for new hardware for the ElasticSearch cluster and if this goes ahead I 
>>> will move to a model of more Graylog nodes with less CPU's and memory for 
>>> each node as I think that will scale better.
>>>
>>> Arie, I will increase the output buffer processors tomorrow to see what 
>>> happens, but I do know that the process buffer gets quite full at times 
>>> while the output buffer is usually almost empty.
>>>
>>> On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> Also check « co stop » metric on VMware. I am sure you have too many 
>>>> vCPUs.
>>>>
>>>> Le 5 mai 2015 à 16:21, Arie <[email protected] <javascript:>> a écrit 
>>>> :
>>>>
>>>> What happens when you raise "outputbuffer_processors = 5" to 
>>>> "outputbuffer_processors = 10" ?
>>>>
>>>> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS:
>>>>>
>>>>> Yesterday I did a yum update on all Graylog and MongoDB nodes and 
>>>>> since doing that and rebooting them all (there was a kernel update) it 
>>>>> seems that there are no longer issues connecting to the Mongo database.
>>>>>
>>>>> However, I'm still seeing excessively high CPU usage on the Graylog 
>>>>> nodes where all vCPU's are regularly exceeding 95%.
>>>>>
>>>>> What can contribute to this? I'm a little stumped at present.
>>>>>
>>>>> I would say our average messages/second is around 5,000 to 6,000 with 
>>>>> peaks up to about 12,000.
>>>>>
>>>>> Cheers, Pete
>>>>>
>>>>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote:
>>>>>>
>>>>>> Does anyone have any thoughts on this?
>>>>>>
>>>>>> Even if someone could identify some scenarios that would cause high 
>>>>>> CPU on Graylog servers and in what circumstances Graylog would have 
>>>>>> trouble 
>>>>>> contacting the MongoDB servers.
>>>>>>
>>>>>> Cheers, Pete
>>>>>>
>>>>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We acquired a company a while ago and last week we added all of 
>>>>>>> their logs to our Graylog environment which all come in from their 
>>>>>>> Syslog 
>>>>>>> server via UDP.
>>>>>>>
>>>>>>> After this, I noticed that the Graylog servers were maxing CPU so to 
>>>>>>> alleviate this I increased CPU resources to the existing servers and 
>>>>>>> added 
>>>>>>> two new servers.
>>>>>>>
>>>>>>> I'm still seeing generally high CPU usage with peaks of 100% on all 
>>>>>>> four of the Graylog servers but I now have issues where they also seem 
>>>>>>> to 
>>>>>>> have issues connecting to MongoDB.
>>>>>>>
>>>>>>> I see lots of "[NodePingThread] Did not find meta info of this node. 
>>>>>>> Re-registering." streaming through the log files but it only seems to 
>>>>>>> happen when I have more than two Graylog servers running.
>>>>>>>
>>>>>>> I have verified NTP is installed and configured and all servers 
>>>>>>> including the MongoDB and ElasticSearch servers are sync'ing with the 
>>>>>>> same 
>>>>>>> NTP servers.
>>>>>>>
>>>>>>> We're doing less than 10,000 messages per second so with the 
>>>>>>> resources I've allocated I would have expected no issues whatsoever.
>>>>>>>
>>>>>>> I have seen this link: 
>>>>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI but 
>>>>>>> I don't believe it is our issue.
>>>>>>>
>>>>>>> If it truly is being caused by doing lots of reverse DNS lookups, I 
>>>>>>> would expect tcpdump to show me that traffic to our DNS servers, but I 
>>>>>>> see 
>>>>>>> almost no DNS lookups at all.
>>>>>>>
>>>>>>> We have 6 inputs in total but only one receives the bulk of the 
>>>>>>> Syslog UDP messages. Most of the other inputs are GELF UDP inputs.
>>>>>>>
>>>>>>> We also have 11 streams, however pausing these streams seems to have 
>>>>>>> little to no impact on the CPU usage.
>>>>>>>
>>>>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update 
>>>>>>> 2 with plenty of physical hardware available to service the workload 
>>>>>>> (little to no contention).
>>>>>>>
>>>>>>> The original two have 20 vCPU's and 32GB RAM, the additional two 
>>>>>>> have 16 vCPU's and 32GB RAM.
>>>>>>>
>>>>>>> Java heap on all is set to 16GB.
>>>>>>>
>>>>>>> This is all running on CentOS 6.
>>>>>>>
>>>>>>> Any input would be greatly appreciated as I'm a bit stumped on how 
>>>>>>> to get this resolved at present.
>>>>>>>
>>>>>>> Here is the config file I'm using (censored where appropriate):
>>>>>>>
>>>>>>> is_master = false
>>>>>>> node_id_file = /etc/graylog2/server/node-id
>>>>>>> password_secret = <Censored>
>>>>>>> root_username = <Censored>
>>>>>>> root_password_sha2 = <Censored>
>>>>>>> plugin_dir = /usr/share/graylog2-server/plugin
>>>>>>> rest_listen_uri = http://172.22.20.66:12900/
>>>>>>>
>>>>>>> elasticsearch_max_docs_per_index = 20000000
>>>>>>> elasticsearch_max_number_of_indices = 999
>>>>>>> retention_strategy = close
>>>>>>> elasticsearch_shards = 4
>>>>>>> elasticsearch_replicas = 1
>>>>>>> elasticsearch_index_prefix = graylog2
>>>>>>> allow_leading_wildcard_searches = true
>>>>>>> allow_highlighting = true
>>>>>>> elasticsearch_cluster_name = graylog2
>>>>>>> elasticsearch_node_name = bne3-0002las
>>>>>>> elasticsearch_node_master = false
>>>>>>> elasticsearch_node_data = false
>>>>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false
>>>>>>> elasticsearch_discovery_zen_ping_unicast_hosts = 
>>>>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,
>>>>>>> bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,
>>>>>>> bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,
>>>>>>> bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,
>>>>>>> bne3-0009lai.server-web.com:9300
>>>>>>> elasticsearch_cluster_discovery_timeout = 5000
>>>>>>> elasticsearch_discovery_initial_state_timeout = 3s
>>>>>>> elasticsearch_analyzer = standard
>>>>>>>
>>>>>>> output_batch_size = 5000
>>>>>>> output_flush_interval = 1
>>>>>>> processbuffer_processors = 20
>>>>>>> outputbuffer_processors = 5
>>>>>>> #outputbuffer_processor_keep_alive_time = 5000
>>>>>>> #outputbuffer_processor_threads_core_pool_size = 3
>>>>>>> #outputbuffer_processor_threads_max_pool_size = 30
>>>>>>> #udp_recvbuffer_sizes = 1048576
>>>>>>> processor_wait_strategy = blocking
>>>>>>> ring_size = 65536
>>>>>>>
>>>>>>> inputbuffer_ring_size = 65536
>>>>>>> inputbuffer_processors = 2
>>>>>>> inputbuffer_wait_strategy = blocking
>>>>>>>
>>>>>>> message_journal_enabled = true
>>>>>>> message_journal_dir = /var/lib/graylog-server/journal
>>>>>>> message_journal_max_age = 24h
>>>>>>> message_journal_max_size = 150gb
>>>>>>> message_journal_flush_age = 1m
>>>>>>> message_journal_flush_interval = 1000000
>>>>>>> message_journal_segment_age = 1h
>>>>>>> message_journal_segment_size = 1gb
>>>>>>>
>>>>>>> dead_letters_enabled = false
>>>>>>> lb_recognition_period_seconds = 3
>>>>>>>
>>>>>>> mongodb_useauth = true
>>>>>>> mongodb_user = <Censored>
>>>>>>> mongodb_password = <Censored>
>>>>>>> mongodb_replica_set = bne3-0001ladb.server-web.com:27017,
>>>>>>> bne3-0002ladb.server-web.com:27017
>>>>>>> mongodb_database = graylog2
>>>>>>> mongodb_max_connections = 200
>>>>>>> mongodb_threads_allowed_to_block_multiplier = 5
>>>>>>>
>>>>>>> #rules_file = /etc/graylog2.drl
>>>>>>>
>>>>>>> # Email transport
>>>>>>> transport_email_enabled = true
>>>>>>> transport_email_hostname = <Censored>
>>>>>>> transport_email_port = 25
>>>>>>> transport_email_use_auth = false
>>>>>>> transport_email_use_tls = false
>>>>>>> transport_email_use_ssl = false
>>>>>>> transport_email_auth_username = [email protected]
>>>>>>> transport_email_auth_password = secret
>>>>>>> transport_email_subject_prefix = [graylog2]
>>>>>>> transport_email_from_email = <Censored>
>>>>>>> transport_email_web_interface_url = <Censored>
>>>>>>>
>>>>>>> message_cache_off_heap = false
>>>>>>> message_cache_spool_dir = 
>>>>>>> /var/lib/graylog2-server/message-cache-spool
>>>>>>> #message_cache_commit_interval = 1000
>>>>>>> #input_cache_max_size = 0
>>>>>>>
>>>>>>> #ldap_connection_timeout = 2000
>>>>>>>
>>>>>>> versionchecks = false
>>>>>>>
>>>>>>> #enable_metrics_collection = false
>>>>>>>
>>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "graylog2" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected] <javascript:>.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to