Hi List, I've got a few questions about How z hardware handles I/Os and LPAR dispatching.
I've done a fair bit of reading, but still some things I don't understand. We are on Z9's We are using shared CPs. We are not using IRD. We have 2 large production LPARs and several smaller LPARs. The 2 prod LPARS have substantially different weights 1:4, due to the CPU workload spread. We also use group capacity limits and an individual capacity limit on the largest LPAR. While the CPU balance is different, the I/O profile is similar, about 5-6000 IOPS on each LPAR. We have 11 logical CPs active on each of the 2 LPARS. We expect peak 4hra of 90 and 400 MSUs, and the weights are set to reflect this. The work between the 2 LPARS is split for licencing. The small LPAR is mainly batch, the large LPAR is online and batch. I believe we are seeing I/O elongation on the smaller LPAR at peak times, particularly when the systems are capped. A batch job I/O bound may run 2-3 times longer on the small LPAR when the systems are busy. The I/O response times look slightly worse on the small LPAR, but the throughput is much worse. So here are my questions. My understanding of the channel program is that it moves the data into the page fixed I/O buffer and the interupt a cp to process the I/O. How is the candidate CP chosen? I know the z/os system may make some CPUs uninteruptable for I/Os based on CPENABLE, but of the CPs that are enabled, how is one chosen? Is it at the physical or logical level and how is it related to the LPAR which requested the I/O? We have CPENABLE set to (10,30), RMF shows all 11 logical CPUs are taking interrupts (and have TPI counts), but CP A (highest number) is doing by far the most. Could this be a cause of contention between the 2 LPARS, or will they likely be dispatched on separate physical CPs? Next question is about the dispatch time given to a LPAR for a CP by pr/sm. The pr/sm planning guide says the maximum time may be between 12.5 and 25ms (A lot longer than an I/O). I am thinking that if an LPAR is constrained by capping, it is more likely to have a queue of ready work and hold on to a CP towards the maximum when it is given one? How can I tell for sure if I am on the right track, any metrics that will prove what is causing the longer elapsed times on one LPAR? What is the best way to stop it or reduce it, given that we have to run capped on peak days, and we have to live with the workload separation, and we don't have the capacity for dedicated CPUs. Would wlm/ird management of CPUs help? Would offlining logical CPs help? We have many more CPs online to each LPAR than it's normal MSU usage, but it give flexibilty for workload peaks. Would offlining specific logical CPs help? Ie if the 2 LPARS had a different highest logical CP number would this reduce contention, or is it again likely to use different physical CPs for different LPARs? Would tuning of the LPAR dispatch time help? We do not specify this, and the recommendation is to let the system choose. Sorry about the length of the post, but hopefully someone will find this an interesting problem. Joe Owens . ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN

