David Boyes wrote:

>Field testing seems to indicate that there is a noticeable benefit to
>this approach.

Could you describe the scenarios where this setup provides benefit?
I guess there might be some benefit in cases where you expect requests
to be satisfied from the MDC frequently, i.e. read/only shared devices
with multiple guests accessing them.

In the case of read/write unshared devices I don't see why this
approach should help, though.

>Is it equivalent to PAV? Maybe not. Perhaps I don't understand PAV well
>enough.
>
>Can you recommend some further background reading?

Well, I don't have any specific documents about PAV; I found this
link which (amongst WLM-related stuff) describes its basic operation:
http://www-1.ibm.com/servers/eserver/zseries/zos/wlm/release_history/sharkptf.html

What you describe is certainly in some way similar, but the question
is whether it addresses the actual bottleneck ...


When you run an I/O intensive workload on a Linux guest, what happens
can be summarized about like this:

Linux user space processes (database, etc.) access files in the Linux
file system.  The Linux kernel initially matches those accesses
against the page cache entries underlying those files; if the cache
is hit, the user space application will continue immediately.

If a file page is not present in the page cache (or if the application
has modified a page and requested synchronous write-out), Linux needs
to perform an actual I/O operation.  To this purpose, an 'I/O request'
data structure is generated, and queued on the I/O request queue for
the device underlying the file system.  If that device is LVM-mapped,
the request will be redistributed to the underlying device(s) by the
LVM layer.

At the layer of real block devices, all such requests from all processes
in the system are collected and queued up.  The requests in the queue
are merged and possibly reordered to optimize the physical accesses
performed.  The resulting requests are then passed to the device driver,
which tranlates each request into a CCW chain.  Those are then used to
initiate S/390 I/O operations (START SUBCHANNEL).  Of course, only one
such operation can be outstanding at the same time on one device.

When running under VM, CP intercepts the SSCH and interprets the CCW
chain.  If the requests can be satisfied from the minidisk cache, CP
will do so at this point, and signal completion of the request to the
guest.  If not, CP will queue the request on its own queue for the
physical device, where it is potentially merged with requests to the
same physical device from other guests.

Then the request goes out to via the I/O subsystem through the channel
to the actual device, which ususally means some storage subsystem, say
a ESS (Shark).  The Shark now tries to satisfy that request from its own
cache; if it misses there, the Shark control unit need to perform
physical I/O operations across its internal data busses to the real
hard disks integrated in the Shark.

Once physical I/O has completed, that event is then signalled back
across this whole chain.


If you want to optimize overall performance in this scenario, the
most important thing is to first identify the actual bottleneck
that is currently limiting performance.  Depending on the workload,
this could be in a variety of places in the chain described above.
However, when using ESCON channels, what we have usually seen is
that the raw throughput of a single physical ESCON channel is the
bottleneck.

If this is the case, then the only thing that helps is to eliminate
this bottleneck, either by going to FICON channels, or else by using
multiple ESCON channels in parallel.  If you want to try the second
option, the problem becomes that any single physical device can only
be accessed via one physical channel at the same time.  To circumvent
this limitation you can either use PAV, which directly addresses this
specific problem, or else try to spread your workload across multiple
*physical* devices, so that at least your total I/O load makes use
of all the channels you have.

This scenario is one you typically want to use an LVM-striped volume
for.  However, for this to have its intended effect, it is crucial
that the multiple volumes underlying the LVM go indeed to multiple
physical devices at the VM level, so that you actually have the
benefit of driving load across multiple ESCON channels.

(Once you have done that, the next bottleneck you will run into is
the internal bandwidth of the Shark subsystem; to circumvent that
you'll want to make sure that those multiple physical volumes are
also spread out across multiple internal busses of the Shark.

The final limit that you cannot circumvent any more is then the
total bandwidth of the Shark controller itself.  You can get up
to this limit with I/O from just a single Linux guest if you have
set up everything right.)


Now, in the scenario that you describe, you have CP set up multiple
virtual devices for the guest which actually map to the same physical
device.  This means that Linux can perform multiple SSCHs on those
devices in parallel.  However, what benefit does this bring?

If the requests are satisfied out of the minidisk cache, this will
happen in a short time anyway, so you don't gain much by parallelism.
If the requests need to go to the real device anyway, then they will
queue up and be limited by the channel bandwidth anyway; only the
queuing now happens in CP instead of Linux.  The only situation where
you conceivably have an advantage is if while one request is processed
at the physical device, another request can be satisfied from the
minidisk cache.

Now, how likely is this to happen?  Note that for the request to
be passed up this far at all, it cannot be in Linux' page cache,
which means it hasn't been accessed by the guest for a while.
Isn't it then likely also not in VM's minidisk cache?  In particular
on read/write data volumes (which are nearly always private to one
guest), I don't see how the minidisk cache can help much here.
(Well, unless you have restricted the guest to so little memory
that it can't maintain a proper page cache ...)

On the other hand, if you have a large number of guests using a
read-only shared device, then I guess minidisk caching becomes
certainly important.


But as always with performance questions, 'It depends!'.
For different application scenarios, different setups will be needed
to give optimal performance.  My comments were aimed at a situation
like the original poster described: running a Linux application with
a high I/O load generated to non-shared read/write devices.


Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand
  Linux for S/390 Design & Development
  IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
  Phone: +49-7031/16-3727   ---   Email: [EMAIL PROTECTED]

Reply via email to