Glad to hear that ! In a virtualized environment the most difficult fact to understand is : « a vCPU is NOT a physical CPU ! ». Most of the time, we need to reduce the number of vCPUs because of stuck core (kernel panic) or very slow behavior.
That is why co-stop metric needs to be monitored next to load average. Mathieu > Le 24 mai 2015 à 23:59, Pete GS <[email protected]> a écrit : > > Good to know Arie! > > And thanks again to Mathieu. > > I have lots of experience with the virtualisation layer (ESX from 2.5 > onwards) but almost none with SMP workloads in a guest above about 8 vCPU's, > so I had never observed this behaviour before. Now I know a bit more, I like > to keep learning! > > I've come in after not looking at or touching Graylog over the weekend and > there is not a single instance of any of the Graylog servers being unable to > contact MongoDB so it looks like this is fixed. > > ElasticSearch cluster is green, none of the Graylog input servers have a > queue of unprocessed messages in the journal, and all are online and working > well. > > Fingers crossed I can get the new ElasticSearch hardware purchase through > which will solve my remaining problems and let the platform be 100% stable. > > Cheers, Pete > > On Saturday, 23 May 2015 07:49:24 UTC+10, Arie wrote: > Same problem here with to many cpu's (not on grayling application). > > What happens is that code swaps continuous between cores, in our case it > helps to bind the > application to a core, but managing it is a dork to do. The virtual layer > looses a lot of resources > in constantly managing resources over the cores. With 2 it can already be up > to 10%! > > We are running a lot of real-time applications and customer wants everything > in the cloud. In our > experience 'cloud' delivers us the most of all our problems/glitches. Love > having some older iron > to run graylog and elastic onto it. > > > > Op vrijdag 22 mei 2015 06:08:44 UTC+2 schreef Pete GS: > Ok, here's where I'm at with this... > > I tried implementing the kernel options on one of the Graylog servers as a > test but it made no appreciable difference. In fact shortly after the first > reboot the VM froze with a locked CPU error. It hasn't done that since a > subsequent reboot though. We're not running the PVSCSI adapter either. > > After observing this, I revisited Mathieu's comment regarding too many CPU's. > > While I still see no contention issues for CPU resources, I started wondering > if there was some SMP related issue with CentOS where the extra vCPU's just > weren't providing enough extra to cater for the workload. > > I scaled all the nodes back to 8 vCPU's and added another four Graylog > servers, so I now have 8 servers receiving the inputs. > > So far this is running a lot better than the four servers with 16 and 20 > vCPU's. They still peak at 100% but this is not sustained, even after having > an ElasticSearch issue (filling the disks again) that caused a backlog in the > message journal overnight. > > Almost all the message backlog in the journals have been processed again and > it's still working well so far, this is after 24 hours or so. > > I'll see how it runs over the weekend. > > Incidentally it seems I have inadvertently stumbled across a good number for > the process buffer processors... it seems to work well at 2 less than the > number of CPU's available to the server. Running with a buffer number of 6 > with 8 vCPU's seems to work well. Of course I'm not sure if this is just in > my particular environment or if it's a general thing. > > Cheers, Pete > > On Thursday, 14 May 2015 19:13:24 UTC+10, Pete GS wrote: > Thanks very much Arie, I will check these tomorrow and report back. > > One thing I can confirm is the heap size is configured correctly. > > Cheers, Pete > > On 14 May 2015, at 05:35, Arie <[email protected] <>> wrote: > > Lets try some more options. > > I see you are running your stuf virtual. Then you can consider the following > for centos6 > > In your startup kernel config you can add the following options > (/etc/grub.conf) > > nohz=off (for high cpu intensive systems) > elevator=noop (disc scheduling is done by the virtual layer, so disable > that) > cgroup_disable=memory (possibly not used, it fees up some memory and > allocation) > > if you use the pvscsi device, add the following: > vmw_pvscsi.cmd_per_lun=254 > vmw_pvscsi.ring_pages=32 > > Check disk buffers on the virtual layer too. vmware kb 2053145 > see > http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2053145&sliceId=1&docTypeID=DT_KB_1_1&dialogID=621755330&stateId=1%200%20593866502 > > <http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2053145&sliceId=1&docTypeID=DT_KB_1_1&dialogID=621755330&stateId=1%200%20593866502> > > Optimize your disk for performance (up to 30%!!! yes): > > for the filesystems were graylog and or elastic is located add the following > to /etc/fstab > > example: > /dev/mapper/vg_nagios-lv_root / ext4 > defaults,noatime,nobarrier,data=writeback 1 1 > and if you want to be more safe: > /dev/mapper/vg_nagios-lv_root / ext4 defaults,noatime,nobarrier 1 1 > > is ES_HEAP_SIZE configured @ the correct place (I did that wrong at first) > it is in /etc/systconfig/elasticsearch > > > All these options together can improve system performance huge specially when > they are virtial. > > ps did you correctly cha > ... > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected] > <mailto:[email protected]>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
