Hi List,
I've got a few questions about How z hardware handles I/Os and LPAR dispatching.

I've done a fair bit of reading, but still some things I don't understand.
We are on Z9's We are using shared CPs. We are not using IRD. We have 2 large 
production LPARs and several smaller LPARs. The 2 prod LPARS have substantially 
different weights 1:4, due to the CPU workload spread. We also use group 
capacity limits and an individual capacity limit on the largest LPAR. While the 
CPU balance is different, the I/O profile is similar, about 5-6000 IOPS on each 
LPAR.

We have 11 logical CPs active on each of the 2 LPARS. We expect peak 4hra of 90 
and 400 MSUs, and the weights are set to reflect this.

The work between the 2 LPARS is split for licencing. The small LPAR is mainly 
batch, the large LPAR is online and batch.

I believe we are seeing I/O elongation on the smaller LPAR at peak times, 
particularly when the systems are capped. A batch job I/O bound may run 2-3 
times longer on the small LPAR when the systems are busy. The I/O response 
times look slightly worse on the small LPAR, but the throughput is much worse.

So here are my questions.

My understanding of the channel program is that it moves the data into the page 
fixed I/O buffer and the interupt a cp to process the I/O. How is the candidate 
CP chosen?  I know the z/os system may make some CPUs uninteruptable for I/Os 
based on CPENABLE, but of the CPs that are enabled, how is one chosen? Is it at 
the physical or logical level and how is it related to the LPAR which requested 
the I/O?

We have CPENABLE set to (10,30), RMF shows all 11 logical CPUs are taking 
interrupts (and have TPI counts), but CP A (highest number) is doing by far the 
most. Could this be a cause of contention between the 2 LPARS, or will they 
likely be dispatched on separate physical CPs?

Next question is about the dispatch time given to a LPAR for a CP by pr/sm. The 
pr/sm planning guide says the maximum time may be between 12.5 and 25ms (A lot 
longer than an I/O). I am thinking that if an LPAR is constrained by capping, 
it is more likely to have a queue of ready work and hold on to a CP towards the 
maximum when it is given one?

How can I tell for sure if I am on the right track, any metrics that will prove 
what is causing the longer elapsed times on one LPAR? 

What is the best way to stop it or reduce it, given that we have to run capped 
on peak days, and we have to live with the workload separation, and we don't 
have the capacity for dedicated CPUs.

Would wlm/ird management of CPUs help?
Would offlining logical CPs help? We have many more CPs online to each LPAR 
than it's normal MSU usage, but it give flexibilty for workload peaks.   
Would offlining specific logical CPs help? Ie if the 2 LPARS had a different 
highest logical CP number would this reduce contention, or is it again likely 
to use different physical CPs for different LPARs?
Would tuning of the LPAR dispatch time help? We do not specify this, and the 
recommendation is to let the system choose.

Sorry about the length of the post, but hopefully someone will find this an 
interesting problem.

Joe Owens



.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to