David Boyes wrote: >Field testing seems to indicate that there is a noticeable benefit to >this approach.
Could you describe the scenarios where this setup provides benefit? I guess there might be some benefit in cases where you expect requests to be satisfied from the MDC frequently, i.e. read/only shared devices with multiple guests accessing them. In the case of read/write unshared devices I don't see why this approach should help, though. >Is it equivalent to PAV? Maybe not. Perhaps I don't understand PAV well >enough. > >Can you recommend some further background reading? Well, I don't have any specific documents about PAV; I found this link which (amongst WLM-related stuff) describes its basic operation: http://www-1.ibm.com/servers/eserver/zseries/zos/wlm/release_history/sharkptf.html What you describe is certainly in some way similar, but the question is whether it addresses the actual bottleneck ... When you run an I/O intensive workload on a Linux guest, what happens can be summarized about like this: Linux user space processes (database, etc.) access files in the Linux file system. The Linux kernel initially matches those accesses against the page cache entries underlying those files; if the cache is hit, the user space application will continue immediately. If a file page is not present in the page cache (or if the application has modified a page and requested synchronous write-out), Linux needs to perform an actual I/O operation. To this purpose, an 'I/O request' data structure is generated, and queued on the I/O request queue for the device underlying the file system. If that device is LVM-mapped, the request will be redistributed to the underlying device(s) by the LVM layer. At the layer of real block devices, all such requests from all processes in the system are collected and queued up. The requests in the queue are merged and possibly reordered to optimize the physical accesses performed. The resulting requests are then passed to the device driver, which tranlates each request into a CCW chain. Those are then used to initiate S/390 I/O operations (START SUBCHANNEL). Of course, only one such operation can be outstanding at the same time on one device. When running under VM, CP intercepts the SSCH and interprets the CCW chain. If the requests can be satisfied from the minidisk cache, CP will do so at this point, and signal completion of the request to the guest. If not, CP will queue the request on its own queue for the physical device, where it is potentially merged with requests to the same physical device from other guests. Then the request goes out to via the I/O subsystem through the channel to the actual device, which ususally means some storage subsystem, say a ESS (Shark). The Shark now tries to satisfy that request from its own cache; if it misses there, the Shark control unit need to perform physical I/O operations across its internal data busses to the real hard disks integrated in the Shark. Once physical I/O has completed, that event is then signalled back across this whole chain. If you want to optimize overall performance in this scenario, the most important thing is to first identify the actual bottleneck that is currently limiting performance. Depending on the workload, this could be in a variety of places in the chain described above. However, when using ESCON channels, what we have usually seen is that the raw throughput of a single physical ESCON channel is the bottleneck. If this is the case, then the only thing that helps is to eliminate this bottleneck, either by going to FICON channels, or else by using multiple ESCON channels in parallel. If you want to try the second option, the problem becomes that any single physical device can only be accessed via one physical channel at the same time. To circumvent this limitation you can either use PAV, which directly addresses this specific problem, or else try to spread your workload across multiple *physical* devices, so that at least your total I/O load makes use of all the channels you have. This scenario is one you typically want to use an LVM-striped volume for. However, for this to have its intended effect, it is crucial that the multiple volumes underlying the LVM go indeed to multiple physical devices at the VM level, so that you actually have the benefit of driving load across multiple ESCON channels. (Once you have done that, the next bottleneck you will run into is the internal bandwidth of the Shark subsystem; to circumvent that you'll want to make sure that those multiple physical volumes are also spread out across multiple internal busses of the Shark. The final limit that you cannot circumvent any more is then the total bandwidth of the Shark controller itself. You can get up to this limit with I/O from just a single Linux guest if you have set up everything right.) Now, in the scenario that you describe, you have CP set up multiple virtual devices for the guest which actually map to the same physical device. This means that Linux can perform multiple SSCHs on those devices in parallel. However, what benefit does this bring? If the requests are satisfied out of the minidisk cache, this will happen in a short time anyway, so you don't gain much by parallelism. If the requests need to go to the real device anyway, then they will queue up and be limited by the channel bandwidth anyway; only the queuing now happens in CP instead of Linux. The only situation where you conceivably have an advantage is if while one request is processed at the physical device, another request can be satisfied from the minidisk cache. Now, how likely is this to happen? Note that for the request to be passed up this far at all, it cannot be in Linux' page cache, which means it hasn't been accessed by the guest for a while. Isn't it then likely also not in VM's minidisk cache? In particular on read/write data volumes (which are nearly always private to one guest), I don't see how the minidisk cache can help much here. (Well, unless you have restricted the guest to so little memory that it can't maintain a proper page cache ...) On the other hand, if you have a large number of guests using a read-only shared device, then I guess minidisk caching becomes certainly important. But as always with performance questions, 'It depends!'. For different application scenarios, different setups will be needed to give optimal performance. My comments were aimed at a situation like the original poster described: running a Linux application with a high I/O load generated to non-shared read/write devices. Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand Linux for S/390 Design & Development IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen Phone: +49-7031/16-3727 --- Email: [EMAIL PROTECTED]
