Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Pete GS Mon, 25 May 2015 15:21:07 -0700

Yes that's very true! A vCPU has definitely gotten closer to a physical CPU 
but there will always be overheads due to the hypervisor.


I've never encountered this many vCPU's and therefore these issues before 
so now I have another metric to add to my toolbox for future reference.

Thanks again Mathieu!

Cheers, Pete

On Tuesday, 26 May 2015 04:59:50 UTC+10, Mathieu Grzybek wrote:
>
> Glad to hear that !
>
> In a virtualized environment the most difficult fact to understand is : 
> « a vCPU is NOT a physical CPU ! ». Most of the time, we need to reduce the 
> number of vCPUs because of stuck core (kernel panic) or very slow behavior.
>
> That is why co-stop metric needs to be monitored next to load average.
>
> Mathieu
>
> Le 24 mai 2015 à 23:59, Pete GS <[email protected] <javascript:>> a 
> écrit :
>
> Good to know Arie!
>
> And thanks again to Mathieu.
>
> I have lots of experience with the virtualisation layer (ESX from 2.5 
> onwards) but almost none with SMP workloads in a guest above about 8 
> vCPU's, so I had never observed this behaviour before. Now I know a bit 
> more, I like to keep learning!
>
> I've come in after not looking at or touching Graylog over the weekend and 
> there is not a single instance of any of the Graylog servers being unable 
> to contact MongoDB so it looks like this is fixed.
>
> ElasticSearch cluster is green, none of the Graylog input servers have a 
> queue of unprocessed messages in the journal, and all are online and 
> working well.
>
> Fingers crossed I can get the new ElasticSearch hardware purchase through 
> which will solve my remaining problems and let the platform be 100% stable.
>
> Cheers, Pete
>
> On Saturday, 23 May 2015 07:49:24 UTC+10, Arie wrote:
>>
>> Same problem here with to many cpu's (not on grayling application).
>>
>> What happens is that code swaps continuous between cores, in our case it 
>> helps to bind the
>> application to a core, but managing it is a dork to do. The virtual layer 
>> looses a lot of resources
>> in constantly managing resources over the cores. With 2 it can already be 
>> up to 10%!
>>
>> We are running a lot of real-time applications and customer wants 
>> everything in the cloud. In our
>> experience 'cloud' delivers us the most of all our problems/glitches. 
>> Love having some older iron
>> to run graylog and elastic onto it.
>>
>>
>>
>> Op vrijdag 22 mei 2015 06:08:44 UTC+2 schreef Pete GS:
>>>
>>> Ok, here's where I'm at with this...
>>>
>>> I tried implementing the kernel options on one of the Graylog servers as 
>>> a test but it made no appreciable difference. In fact shortly after the 
>>> first reboot the VM froze with a locked CPU error. It hasn't done that 
>>> since a subsequent reboot though. We're not running the PVSCSI adapter 
>>> either.
>>>
>>> After observing this, I revisited Mathieu's comment regarding too many 
>>> CPU's.
>>>
>>> While I still see no contention issues for CPU resources, I started 
>>> wondering if there was some SMP related issue with CentOS where the extra 
>>> vCPU's just weren't providing enough extra to cater for the workload.
>>>
>>> I scaled all the nodes back to 8 vCPU's and added another four Graylog 
>>> servers, so I now have 8 servers receiving the inputs.
>>>
>>> So far this is running a lot better than the four servers with 16 and 20 
>>> vCPU's. They still peak at 100% but this is not sustained, even after 
>>> having an ElasticSearch issue (filling the disks again) that caused a 
>>> backlog in the message journal overnight.
>>>
>>> Almost all the message backlog in the journals have been processed again 
>>> and it's still working well so far, this is after 24 hours or so.
>>>
>>> I'll see how it runs over the weekend.
>>>
>>> Incidentally it seems I have inadvertently stumbled across a good number 
>>> for the process buffer processors... it seems to work well at 2 less than 
>>> the number of CPU's available to the server. Running with a buffer number 
>>> of 6 with 8 vCPU's seems to work well. Of course I'm not sure if this is 
>>> just in my particular environment or if it's a general thing.
>>>
>>> Cheers, Pete
>>>
>>> On Thursday, 14 May 2015 19:13:24 UTC+10, Pete GS wrote:
>>>>
>>>> Thanks very much Arie, I will check these tomorrow and report back.
>>>>
>>>> One thing I can confirm is the heap size is configured correctly.
>>>>
>>>> Cheers, Pete
>>>>
>>>> On 14 May 2015, at 05:35, Arie <[email protected]> wrote:
>>>>
>>>> Lets try some more options.
>>>>
>>>> I see you are running your stuf virtual. Then you can consider the 
>>>> following for centos6
>>>>
>>>> In your startup kernel config you can add the following options 
>>>> (/etc/grub.conf)
>>>>
>>>>   nohz=off (for high cpu intensive systems)
>>>>   elevator=noop (disc scheduling is done by the virtual layer, so 
>>>> disable that)
>>>>   cgroup_disable=memory (possibly not used, it fees up some memory and 
>>>> allocation)
>>>>   
>>>> if you use the pvscsi device, add the following:
>>>>   vmw_pvscsi.cmd_per_lun=254 
>>>>  vmw_pvscsi.ring_pages=32
>>>>
>>>>  Check disk buffers on the virtual layer too. vmware kb 2053145
>>>>   see 
>>>> http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2053145&sliceId=1&docTypeID=DT_KB_1_1&dialogID=621755330&stateId=1%200%20593866502
>>>>
>>>>  Optimize your disk for performance (up to 30%!!! yes):
>>>>
>>>>  for the filesystems were graylog and or elastic is located add the 
>>>> following to /etc/fstab
>>>>
>>>> example:
>>>> /dev/mapper/vg_nagios-lv_root /  ext4 
>>>> defaults,noatime,nobarrier,data=writeback 1 1
>>>> and if you want to be more safe:
>>>> /dev/mapper/vg_nagios-lv_root /  ext4 defaults,noatime,nobarrier 1 1   
>>>>  
>>>>
>>>> is ES_HEAP_SIZE configured @ the correct place (I did that wrong at 
>>>> first)
>>>> it is in /etc/systconfig/elasticsearch
>>>>
>>>>
>>>> All these options together can improve system performance huge 
>>>> specially when they are virtial.
>>>>
>>>> ps did you correctly cha
>>>>
>>>> ...
>>>
>>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to