Re: IRD & Short Engine Effect at 100% CEC utilization

Diehl, Gary (MVSSupport) Fri, 10 Jun 2005 13:26:21 -0700

Don,

Thank you for taking the time to try to respond to my question.  All of
your suggestions were well thought out, and appreciated.  However, I
think I've tried them already.  I'll try to explain what I saw, what led
up to it, and the conclusions I came to as best as I can.  Forgive me if
it ends up being rather lengthy -- I'm frequently groaned at for not
being brief enough. :o)


As a reminder, the "boiled down" question is : Why didn't IRD take
enough LPs off to prevent short engines?  When I took some off manually,
the short engine effect was greatly reduced and the CPU input queue
quickly cleared to more sane levels.

First of all, undoubtedly like many of you, we try to get as much work
out of the existing hardware, without upgrading, as possible.  Being in
the Insurance business puts the onus upon us to maintain a white space
of overhead to account for spikes in activity (say, when a hurricane
strikes and lots of claims come in, and the like).  But, even with that
being the case, our systems are tight enough that twice a month our
usual spikes in workload run the systems right up to 100%, for the
better part of prime shift, for several days.  Our users know to expect
a slightly degraded response time during these periods.

In the past, the twice-a-month effect has been pronounced enough that on
LPAR'd systems I've had to reduce the number of LCPUs online to match
the same as the number of PCPUs configured to reduce queuing and allow
even the most important work to get done in a timely manner.  I've been
on phone conferences where the "usual response" the user gets is
subsecond, and during the busy time their response has dipped to 30
seconds or longer -- and taking LCPUs offline to "square" the box has
had the dramatic effect of reducing that 30 seconds to something more
like 5 seconds.  I've run thousands of REPORTS(CPU) and SCPER reports
through RMF post processor, and what Peter Enrico says bears itself out
in production -- any LPAR to MVS busy percentage that differs by more
than 10% puts you in the "danger zone" for possible short engine queuing
during 100% busy times.  I've spent hundreds of hours tuning these
systems to attempt to balance them to the point where the IN READY list
of DISTRIBUTION OF QUEUE LENGTHS stays within the rule of thumb, "80% <=
3 to 4 times the number of CPs available to the LPAR."  So on an LPAR
with 3 PCPUs available to it, 80% should be reached by summarizing the
first 9 buckets (3 PCPUs * 3) to 12 buckets (3 PCPUs * 4).  When a
system is overcommitted, with multiple LPARs and queue lengths mostly
falling into bucket 14, and then is "squared up" so the number of LCPUs
matches the PCPUs, the DISTRIBUTION OF QUEUE LENGTHS migrates left (to
lower buckets).  Having seen this situation many times in the past, and
handling it manually, I naturally expected that IRD would "square" a box
whenever the queue lengths began to reach for the sky.  In the Redbook
it even says that IRD's role is "to bring the number of logical CPs in
line with the capacity required by the LP."

When we brought in IRD, we turned on weight management and CPU Vary
management at the same time.  However, there were a few lessons learned
here.  First, if the LCPUs aren't online, then IRD won't mess with them
- so after watching the systems for a week, we noticed even at times of
100% busy that CPU Vary was not happening, even though Weight Management
was happily tweaking weights by +/- 5% all day long.  No LCPUs in D
M=CPU ever got the "W" flag on them, IRD didn't touch them at all.  More
hours reading the Redbook, Al Sherkow's paper "Engines, Weights, Shares,
& Defined Capacities - Specifying the size of your IRD LPARs," and Walt
Caprice's "A System's Programmer View of IRD" revealed several items
that we had not tried yet.  First, recommendations said to put a Minimum
Weight of 1, and leave Maximum Weight blank for Weight Management -
which upon roll out we had not done : we'd been conservative and kept
the minimum weight equal to at least two engines of capacity because
we'd heard of people that had complained of systems getting varied down
to just one LCPU and it killed CICS multiprocessing.  So we picked two
CECs that that had the most PCPUs, and set their min to 1, max to blank.
Second, recommendations said to put all available LCPUs online to every
LPAR, and let IRD take care of them.  As the redbook states, "We also
recommend defining your production LPs that will be using WLM LPAR CPU
Management to have the Initial number of logical CPs equal to the number
of shared physical CPs on the CPC."  AHA!  We hadn't done this, thinking
"let's avoid short engines", and also expecting that IRD would bring the
RESERVE LCPUs online on it's own.  This was faulty thinking, and to
those same two CECs, we brought all LCPUs online.  After all, the
redbook states right there, "...WLM LPAR Vary CPU Management will work
to minimize LPAR overhead..."

One of those CECs, which has 11 PCPUs and two LPARS (LPA1 which had 6
LCPUs on, LPA2 which had 5 LCPUs on prior to IRD) is what I'm going to
concentrate on here.  We brought all 11 LCPUs online to each LPAR,
minimum LPAR weight of 1, maximum weight blank, and let IRD go to town.
We watched those two LPARs and the CEC itself very closely, to ensure
that there would be no negative impact to the regular workloads.

Here's the basics of what we saw:

1.  When both LPARs are idle (50% MVS busy or less), both get all 11
LCPUs online.  The books states this is "so the workload can take
advantage of increased multiprocessing".  This doesn't seem to be an
issue.  I don't know if it really helps, but it doesn't seem to hurt.

2.  When one LPAR is trying to "take over" the CEC and the other LPAR is
idle, CPU Vary _ALWAYS_ put all 11 LCPUs online to the busy LPAR and cut
back the idle one to no less than 5 LCPUs.  Why 5?  I don't know, and
this was one of the questions I asked to the list -- EXACTLY how does
IRD determine how many LCPUs to leave online (i.e. give me the
calculation, please)?  There was a period of reports that I ran where
LPA1 was trying to get the whole box, and got 84% of it with 11 LCPUs
online, and LPA2 wasn't suffering with it's 16% of the box but 16% of
the CEC is 1.76 PCPUs.  So why didn't IRD drop it to 2 LCPUs?  Or even
3?  I know it's supposed to maintain a "margin of extra LCPUs online" in
case of sudden capacity demand, but why allow that to run the engines
short?  Multiprocessing or no, keeping the input queue low is clearly a
key attribute to getting more work done at busy times.

3.  When both LPARs are trying to take over the CEC, with low
importance-5 work (a test we contrived), we were frustrated and
disappointed by IRD's behavior.  It took the "current" LPAR weights (not
initial, not min), and enforced them as maximum shares, and DID NOT
change from these weights at all during the 3-hours of testing (though
both systems were running 11 CPU loopers to simulate a 50/50 workload
split).  The weights stayed at what they had been going in to the test,
keeping a 28/72 ratio and enforcing it at the LPAR PR/SM level.  CPUs
came off, controlled by IRD as we moved the heavy load from one system
to the other and back again, but we still ended up with 5 logicals on
one, and 10 logicals on the other.  I guess that IRD knew it was low
priority, and didn't care about changing the weights to help it out.
It's puzzling though - so here's another lesson learned, don't expect
IRD to shift weights to help workloads of extremely low priority, even
if it's the ONLY workload that needs CPU on an LPAR.

4.  During the usual production load on the two LPARs, IRD seemed to
prefer to keep the LPAR versus MVS busy within 20% of each other, which
is a good thing.  It would take LCPUs offline, or put them on, at times,
but the LPB/MVS differential never seemed to be more than 20% under
normal circumstances even as workload shifted from LPA1-heavy to
LPA2-heavy (as batch schedules on each kicked off at their usual times).
It seemed to try to keep as many LCPUs online as it could, and still
keep it within this 20% range.

5.  With production work, during one of the two-times-per-month busy
times, with importance 1 and 2 work reaching for more CPU on both LPARs,
and plenty of importance 3, 4, and 5 work in and ready, IRD seemingly
failed us.

The CEC was at 100.0 % busy.

LPA1, which "normally" uses an average of 30% of the CEC, and used to be
constrained to 5/11 (45%) of the CEC at busy time due to only having 5
LCPUs online to it, was now starving it's importance 3, 4, and 5 work
(perf indexes of 14 to 280 on average) and getting right at 30% of the
CEC.  IRD had varied off LCPUs down to 5 LCPUs.  MVS busy in
REPORTS(CPU) showed 66.76% LPAR busy versus 100.0% MVS busy (can we say
"short friggin engines?!?").  The distribution of queue lengths showed
99.6% of them in bucket 14+, with 35 asids on average in and ready.
LPAR management overhead was 0.06%, very low.

LPA2, which "normally" uses an average of 45% of the CEC, and used to be
constrained to 6/11 (54%) of the CEC at busy time due to only having 6
LCPUs online to it, was now kicking LPA1 to the curb.  It's own
importance 3, 4, and 5 work were hurting a little bit, with performance
indexes ranging from 1 to 4.  It was getting a good solid 70% of the
CEC, with 9 LCPUs online.  Granted, it needed 7.7 PCPUs to get that
much, but why 9 LCPUs when LPA1 was tanking so bad?  IRD had taken the
other two LCPUs offline, when it went down to 9.  MVS BUSY in
REPORTS(CPU) showed 83.72% LPAR busy versus 95.37% MVS busy.  The
distributions of queue lengths showed 64.9% in the 14+ bucket, average
of 18.8 IN and READY asids.  The rest of the queue lengths were pretty
much even between buckets 7-8, 9-10, 11-12, and 13-14 with around 9%
each.

If IRD really did intend to kick LPA1 in the head, and not give it CPU
to meet it's workload demand, and favor LPA2, then why didn't it take
LPA1 down to 4 LCPUs, and give 8 to LPA2?  That's still 12/11 but would
have been much better than what it was getting in terms of out/ready
queuing.

Doesn't IRD look at the queue length distribution, see that nearly all
the work is in bucket 14+ and go "uh oh, I better see if I can help that
one"?  Apparently not.

I saw the CPU Input Queue climb over 30, then over 40, then over 50, and
when it got to 58 for LPA1, I'd had enough.  While I watched, LPA2 even
moved back up to 10 LCPUs, with 5 still on LPA1.  I looked at how much
PCPU each LPAR was getting, and I dropped LPA2 down to 8 LCPUs.

As soon as I dropped LPA2 down to 8 LCPUs, IRD responded by putting a
6th LCPU online to LPA1, and LPA1 jumped to 40% utilization of the box.
Within a minute or two, the CPU In/Ready queue dropped to the 40's, and
in 10 minutes it's was down in the low 30's.  It stayed in the mid 20's
for most of the rest of prime shift.  The In/Ready queue for LPA2 never
got over 20 the whole time, was mostly in the 10 to 13 range.

LPA1's LPAR busy came up to 75% busy versus 100.0% MVS busy after the
LCPUs were taken off of LPA2.  After just a few 15 minute intervals,
LPA1's utilization of the CEC returned to around 30%, and it seemed
satisfied by that.  LPA2's LPAR versus MVS busy matched to nearly the
exact same number after taking those LCPUs offline.

Why did I have to take LCPUs off of LPA2 to get LPA1 out of a hole?

We run the same kind of Importance 1 and 2 work (and 3-5 for that
matter) on both LPARs.  There is more workload demand on LPA2, but why
should LPA2 choke out LPA1 just because it's bigger?  Does that make it
more important (it shouldn't).  LPA1 was suffering, more than LPA2, at
all importance levels, though Importance 1 and 2 on LPA2 was in the 0.8
to 1.0 PI range and on LPA1 was in the 0.8 to 1.2 PI range.  So what
gives?

That should pretty much bring you up to speed with where I'm at.  Note
that, for privacy reasons, I didn't tell you the actual LPAR names, nor
did I name any specific workload volumes, workload specific names, user
names, hardware makes, hardware models, locations, or anything other
than internal performance numbers.  I didn't tell you the names of all
the wonderful individuals that helped make IRD possible, nor did I
reveal the full testing plan for pushing IRD to it's max, and who helped
me make that happen.  I tried to keep all of the company-specific jargon
out of it and just give you the performance related numbers.  Gotta keep
the lawyers happy, you know. :o)

I apologize for my lack of brevity.  I welcome any and all questions,
comments, criticisms, or random statements.

My take on IRD, at the moment, is that "it's great, but watch out when
you want to push the CEC to it's limits, IRD doesn't like to do that.
You're better off capping LPARs yourself by taking LCPUs offline that
you never want the LPAR to use, regardless of IRD'd pie in the sky
multiprocessing recommendations."

The very best regards to you,

Gary Diehl
MVS System Performance & JOAT

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: IRD & Short Engine Effect at 100% CEC utilization

Reply via email to