Hi Nick, We monitor VM on page usage and page IO, our guest on VM for Queue and storage usage (main, xstor and swap). Also we monitor guest CPU usage and metrics like the limit list. Linux memory is always at 100% so no sense in monitoring over there but we do monitor swap usage. Linux CPU gives bad numbers to start with (yes even on current kernel levels they are still wrong) so don't monitor CPU on the guests.
Actually, 100% CPU is not a bad thing at all. Where most OS-ses become less responsive above 90% z/VM will still give you good response even at high numbers. We like to have it above 90%. Obviously you would need some capacity for new guests. So when you are running 100% CPU all the time there can be a case for an additional IFL. But also look at the guests, determine if they are running processes you don't need or that hurt overal performance. Watch your linux guests on responsetimes and batch runtimes. Set a good relative share and if that doesn't help you could consider adding IFL's. Keep VM paging below 50%, add paging DASD when needed. We have a VM that is overcommitted to 9:1. Our production Linux VM is at 2:1 with room to spare. Expect even high page IO rates, 1000's IO/sec don't have to be bad. Keep an eye on guests that are competing for storage. Especially loading users and E-lists can point to a resource problem. Try to fix it on the guest first (eliminate processes, reduce memory sizes etc). Make sure the guests don't stay in Q3. It will hurt other servers. So eliminate unused processes, don't use pings or other keep alive tooling. Be aware that most regular linux tooling keeps the guest active. Obviously when you are running batch the guest will stay in Q3 but then it's in there for a reason. Some of these issues are also covered in the linux-390 list (http://www2.marist.edu/htbin/wlvindex?LINUX-390). Take a look over there also. Regards, Berry. Op 02-03-11 23:28, Nick Warren schreef: > Hi Tony, Thanks for the response. > > I probably didn't ask the question(s) very well. I'm working with a customer > that has no capacity plan regarding the use of z/VM as a linux host. We're > seeing both CPU and Memory usage on the z/VM side increasing. Performance on > the linux guests is acceptable at this time. > > Aside from waiting for the linux users to start complaining - what metrics > and thresholds should I be tracking as early predictors of capacity problems? > > Obviously if CPU usage is constantly 100% that's probably not good. I'm > currently watching CPU, IOWait and Stolen time but wonder if those are > sufficient. Any suggestion as what a good maximum number is? > > Memory is a larger concern - In a previous life as a mvs sysprog I would > watch paging/swapping and delay times among others. Are there any rules of > thumb regarding paging or swapping in z/VM? Is there something better that > paging/swapping for capacity prediction? > > Thanks again, > > Nick > > ---------------------------------------- > >> Date: Wed, 2 Mar 2011 13:47:42 -0800 >> From: [email protected] >> Subject: Re: Capacity Monitoring question >> To: [email protected] >> >> We use Performance Toolkit with APPLDATA enabled, then from option 29 in Perf >> toolkit we get >> >> Linux screens selection >> S Display Description >> . LINUX RMF PM system selection menu >> . LXCPU Summary CPU activity display >> . LXMEM Summary memory util. & activity display >> . LXNETWRK Summary network activity display >> >> Interval 02:11:28-08:44:10, on 2011/03/03 (CURRENT interval, select >> interim or >> average data) >> ______ . . . . . . . . . . >> . . . . . . >> >> <------------------- Total CPU ---------------------------> >> <------------- Processes --------------> >> Linux Virt <---------------- Utilization (%) ------------------------> >> <---- >> Current -----> <-Average Running-> Nr of >> Userid CPUs TotCPU User Kernel Nice IRQ SoftIRQ IOWait Idle Stolen >> Runabl >> Waiting Total 1_Min 5_Min 15_Min Users >> >>> System< 2.0 4.4 2.3 1.9 .0 .0 .1 .9 193.2 1.6 >>> 2.0 .0 434.5 .08 .15 .12 4 >>> >> DLVOMG01 2 .4 .2 .2 .0 .0 .0 .2 198.8 .6 >> 2 0 215 .00 .00 .00 >> >> Interval 02:11:28-08:44:10, on 2011/03/03 (CURRENT interval, select >> interim or >> average data) >> ______ . . . . . . . . . >> . . . . . . . >> >> <------------ Memory Allocation (MB) -------------> <------- >> Swapping >> -------> <--- Pages/s ---> <-BlockIO-> >> Linux <--- Main ---> <--- High ---> Buffers Cache <-Space (MB)-> >> <-Pgs/sec-> Allo <-Faults--> <--kB/sec-> Nr of >> Userid M_Total %MUsed H_Total %HUsed Shared /CaFree Used S_Total %SUsed >> In Out cates Major Minor Read Write Users >> >>> System< 3516 98.2 .0 .0 .0 240.7 1855 1744 .1 >>> .000 .000 331.1 .000 916.1 43.57 37.53 4 >>> >> DLVOMG01 2007 98.1 .0 .0 .0 225.7 1495 256.0 .0 >> .000 .000 103.1 .000 229.5 55.16 18.51 >> >> Regards, >> Tony >> >> >> >> ----- Original Message ---- >> From: Nick Warren >> To: [email protected] >> Sent: Thu, 3 March, 2011 7:24:09 AM >> Subject: Re: Capacity Monitoring question >> >> Sorry, Didn't know protocol. >> >> Nick Warren - freelance consultant. New to z/VM - experience with AIX, HPUX, >> MVS, Windows. >> ________________________________ >> >>> Date: Wed, 2 Mar 2011 20:43:35 +0000 >>> From: [email protected] >>> Subject: Re: Capacity Monitoring question >>> To: [email protected] >>> >>> If you want help - you're gonna have to introduce yourself... >>> >>> Scott Rohling >>> >>> On Mar 2, 2011 12:22pm, new zvm wrote: >>> >>>> I'm relative new to z/VM. I have a couple of z/VM LPARs runing Linux >>>> >>> guests and have more coming. >>> >>>> >>>> >>>> What I'm wondering is this: >>>> >>>> >>>> >>>> What z/VM metrics are you monitoring and what thresholds do you use >>>> >>> as indicators that more capacity is needed - Specifically CPU and >>> Memory. >>> >>>> >>>> >>>> Thanks in advance >>>> >>>> >>>> >>>> New 2 zVM >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >> >> > > >
