Don, Thank you for taking the time to try to respond to my question. All of your suggestions were well thought out, and appreciated. However, I think I've tried them already. I'll try to explain what I saw, what led up to it, and the conclusions I came to as best as I can. Forgive me if it ends up being rather lengthy -- I'm frequently groaned at for not being brief enough. :o)
As a reminder, the "boiled down" question is : Why didn't IRD take enough LPs off to prevent short engines? When I took some off manually, the short engine effect was greatly reduced and the CPU input queue quickly cleared to more sane levels. First of all, undoubtedly like many of you, we try to get as much work out of the existing hardware, without upgrading, as possible. Being in the Insurance business puts the onus upon us to maintain a white space of overhead to account for spikes in activity (say, when a hurricane strikes and lots of claims come in, and the like). But, even with that being the case, our systems are tight enough that twice a month our usual spikes in workload run the systems right up to 100%, for the better part of prime shift, for several days. Our users know to expect a slightly degraded response time during these periods. In the past, the twice-a-month effect has been pronounced enough that on LPAR'd systems I've had to reduce the number of LCPUs online to match the same as the number of PCPUs configured to reduce queuing and allow even the most important work to get done in a timely manner. I've been on phone conferences where the "usual response" the user gets is subsecond, and during the busy time their response has dipped to 30 seconds or longer -- and taking LCPUs offline to "square" the box has had the dramatic effect of reducing that 30 seconds to something more like 5 seconds. I've run thousands of REPORTS(CPU) and SCPER reports through RMF post processor, and what Peter Enrico says bears itself out in production -- any LPAR to MVS busy percentage that differs by more than 10% puts you in the "danger zone" for possible short engine queuing during 100% busy times. I've spent hundreds of hours tuning these systems to attempt to balance them to the point where the IN READY list of DISTRIBUTION OF QUEUE LENGTHS stays within the rule of thumb, "80% <= 3 to 4 times the number of CPs available to the LPAR." So on an LPAR with 3 PCPUs available to it, 80% should be reached by summarizing the first 9 buckets (3 PCPUs * 3) to 12 buckets (3 PCPUs * 4). When a system is overcommitted, with multiple LPARs and queue lengths mostly falling into bucket 14, and then is "squared up" so the number of LCPUs matches the PCPUs, the DISTRIBUTION OF QUEUE LENGTHS migrates left (to lower buckets). Having seen this situation many times in the past, and handling it manually, I naturally expected that IRD would "square" a box whenever the queue lengths began to reach for the sky. In the Redbook it even says that IRD's role is "to bring the number of logical CPs in line with the capacity required by the LP." When we brought in IRD, we turned on weight management and CPU Vary management at the same time. However, there were a few lessons learned here. First, if the LCPUs aren't online, then IRD won't mess with them - so after watching the systems for a week, we noticed even at times of 100% busy that CPU Vary was not happening, even though Weight Management was happily tweaking weights by +/- 5% all day long. No LCPUs in D M=CPU ever got the "W" flag on them, IRD didn't touch them at all. More hours reading the Redbook, Al Sherkow's paper "Engines, Weights, Shares, & Defined Capacities - Specifying the size of your IRD LPARs," and Walt Caprice's "A System's Programmer View of IRD" revealed several items that we had not tried yet. First, recommendations said to put a Minimum Weight of 1, and leave Maximum Weight blank for Weight Management - which upon roll out we had not done : we'd been conservative and kept the minimum weight equal to at least two engines of capacity because we'd heard of people that had complained of systems getting varied down to just one LCPU and it killed CICS multiprocessing. So we picked two CECs that that had the most PCPUs, and set their min to 1, max to blank. Second, recommendations said to put all available LCPUs online to every LPAR, and let IRD take care of them. As the redbook states, "We also recommend defining your production LPs that will be using WLM LPAR CPU Management to have the Initial number of logical CPs equal to the number of shared physical CPs on the CPC." AHA! We hadn't done this, thinking "let's avoid short engines", and also expecting that IRD would bring the RESERVE LCPUs online on it's own. This was faulty thinking, and to those same two CECs, we brought all LCPUs online. After all, the redbook states right there, "...WLM LPAR Vary CPU Management will work to minimize LPAR overhead..." One of those CECs, which has 11 PCPUs and two LPARS (LPA1 which had 6 LCPUs on, LPA2 which had 5 LCPUs on prior to IRD) is what I'm going to concentrate on here. We brought all 11 LCPUs online to each LPAR, minimum LPAR weight of 1, maximum weight blank, and let IRD go to town. We watched those two LPARs and the CEC itself very closely, to ensure that there would be no negative impact to the regular workloads. Here's the basics of what we saw: 1. When both LPARs are idle (50% MVS busy or less), both get all 11 LCPUs online. The books states this is "so the workload can take advantage of increased multiprocessing". This doesn't seem to be an issue. I don't know if it really helps, but it doesn't seem to hurt. 2. When one LPAR is trying to "take over" the CEC and the other LPAR is idle, CPU Vary _ALWAYS_ put all 11 LCPUs online to the busy LPAR and cut back the idle one to no less than 5 LCPUs. Why 5? I don't know, and this was one of the questions I asked to the list -- EXACTLY how does IRD determine how many LCPUs to leave online (i.e. give me the calculation, please)? There was a period of reports that I ran where LPA1 was trying to get the whole box, and got 84% of it with 11 LCPUs online, and LPA2 wasn't suffering with it's 16% of the box but 16% of the CEC is 1.76 PCPUs. So why didn't IRD drop it to 2 LCPUs? Or even 3? I know it's supposed to maintain a "margin of extra LCPUs online" in case of sudden capacity demand, but why allow that to run the engines short? Multiprocessing or no, keeping the input queue low is clearly a key attribute to getting more work done at busy times. 3. When both LPARs are trying to take over the CEC, with low importance-5 work (a test we contrived), we were frustrated and disappointed by IRD's behavior. It took the "current" LPAR weights (not initial, not min), and enforced them as maximum shares, and DID NOT change from these weights at all during the 3-hours of testing (though both systems were running 11 CPU loopers to simulate a 50/50 workload split). The weights stayed at what they had been going in to the test, keeping a 28/72 ratio and enforcing it at the LPAR PR/SM level. CPUs came off, controlled by IRD as we moved the heavy load from one system to the other and back again, but we still ended up with 5 logicals on one, and 10 logicals on the other. I guess that IRD knew it was low priority, and didn't care about changing the weights to help it out. It's puzzling though - so here's another lesson learned, don't expect IRD to shift weights to help workloads of extremely low priority, even if it's the ONLY workload that needs CPU on an LPAR. 4. During the usual production load on the two LPARs, IRD seemed to prefer to keep the LPAR versus MVS busy within 20% of each other, which is a good thing. It would take LCPUs offline, or put them on, at times, but the LPB/MVS differential never seemed to be more than 20% under normal circumstances even as workload shifted from LPA1-heavy to LPA2-heavy (as batch schedules on each kicked off at their usual times). It seemed to try to keep as many LCPUs online as it could, and still keep it within this 20% range. 5. With production work, during one of the two-times-per-month busy times, with importance 1 and 2 work reaching for more CPU on both LPARs, and plenty of importance 3, 4, and 5 work in and ready, IRD seemingly failed us. The CEC was at 100.0 % busy. LPA1, which "normally" uses an average of 30% of the CEC, and used to be constrained to 5/11 (45%) of the CEC at busy time due to only having 5 LCPUs online to it, was now starving it's importance 3, 4, and 5 work (perf indexes of 14 to 280 on average) and getting right at 30% of the CEC. IRD had varied off LCPUs down to 5 LCPUs. MVS busy in REPORTS(CPU) showed 66.76% LPAR busy versus 100.0% MVS busy (can we say "short friggin engines?!?"). The distribution of queue lengths showed 99.6% of them in bucket 14+, with 35 asids on average in and ready. LPAR management overhead was 0.06%, very low. LPA2, which "normally" uses an average of 45% of the CEC, and used to be constrained to 6/11 (54%) of the CEC at busy time due to only having 6 LCPUs online to it, was now kicking LPA1 to the curb. It's own importance 3, 4, and 5 work were hurting a little bit, with performance indexes ranging from 1 to 4. It was getting a good solid 70% of the CEC, with 9 LCPUs online. Granted, it needed 7.7 PCPUs to get that much, but why 9 LCPUs when LPA1 was tanking so bad? IRD had taken the other two LCPUs offline, when it went down to 9. MVS BUSY in REPORTS(CPU) showed 83.72% LPAR busy versus 95.37% MVS busy. The distributions of queue lengths showed 64.9% in the 14+ bucket, average of 18.8 IN and READY asids. The rest of the queue lengths were pretty much even between buckets 7-8, 9-10, 11-12, and 13-14 with around 9% each. If IRD really did intend to kick LPA1 in the head, and not give it CPU to meet it's workload demand, and favor LPA2, then why didn't it take LPA1 down to 4 LCPUs, and give 8 to LPA2? That's still 12/11 but would have been much better than what it was getting in terms of out/ready queuing. Doesn't IRD look at the queue length distribution, see that nearly all the work is in bucket 14+ and go "uh oh, I better see if I can help that one"? Apparently not. I saw the CPU Input Queue climb over 30, then over 40, then over 50, and when it got to 58 for LPA1, I'd had enough. While I watched, LPA2 even moved back up to 10 LCPUs, with 5 still on LPA1. I looked at how much PCPU each LPAR was getting, and I dropped LPA2 down to 8 LCPUs. As soon as I dropped LPA2 down to 8 LCPUs, IRD responded by putting a 6th LCPU online to LPA1, and LPA1 jumped to 40% utilization of the box. Within a minute or two, the CPU In/Ready queue dropped to the 40's, and in 10 minutes it's was down in the low 30's. It stayed in the mid 20's for most of the rest of prime shift. The In/Ready queue for LPA2 never got over 20 the whole time, was mostly in the 10 to 13 range. LPA1's LPAR busy came up to 75% busy versus 100.0% MVS busy after the LCPUs were taken off of LPA2. After just a few 15 minute intervals, LPA1's utilization of the CEC returned to around 30%, and it seemed satisfied by that. LPA2's LPAR versus MVS busy matched to nearly the exact same number after taking those LCPUs offline. Why did I have to take LCPUs off of LPA2 to get LPA1 out of a hole? We run the same kind of Importance 1 and 2 work (and 3-5 for that matter) on both LPARs. There is more workload demand on LPA2, but why should LPA2 choke out LPA1 just because it's bigger? Does that make it more important (it shouldn't). LPA1 was suffering, more than LPA2, at all importance levels, though Importance 1 and 2 on LPA2 was in the 0.8 to 1.0 PI range and on LPA1 was in the 0.8 to 1.2 PI range. So what gives? That should pretty much bring you up to speed with where I'm at. Note that, for privacy reasons, I didn't tell you the actual LPAR names, nor did I name any specific workload volumes, workload specific names, user names, hardware makes, hardware models, locations, or anything other than internal performance numbers. I didn't tell you the names of all the wonderful individuals that helped make IRD possible, nor did I reveal the full testing plan for pushing IRD to it's max, and who helped me make that happen. I tried to keep all of the company-specific jargon out of it and just give you the performance related numbers. Gotta keep the lawyers happy, you know. :o) I apologize for my lack of brevity. I welcome any and all questions, comments, criticisms, or random statements. My take on IRD, at the moment, is that "it's great, but watch out when you want to push the CEC to it's limits, IRD doesn't like to do that. You're better off capping LPARs yourself by taking LCPUs offline that you never want the LPAR to use, regardless of IRD'd pie in the sky multiprocessing recommendations." The very best regards to you, Gary Diehl MVS System Performance & JOAT ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html

