****************** Very long post warning *****************
I thought I'd posted this before, but didn't see it in the
archives. This was originally written as a post to IBM's
internal software packaging forum on the topic of block sizes to
recommend for different system software data sets.
It's a bit dated now (written when SLED was the order of the day
and RAID was new) but I think most of it still applies. One
thing it doesn't include that I should have mentioned back then
is the overhead of simply needing more CCWs to get the job done;
this still applies even though some of the physical delays no
longer do. But time to think about those updates and make them
below is gone for this week, so...here 'tis, as it is.
Also, thanks to Darren for his help in allowing this extremely
long post.
What's a Block, Anyway?
When doing I/O to a tape or DASD device, LRECL is irrelevant.
Only the block size matters. This is because each physical
record on tape and DASD is what we in software call a block.
(This leads to interesting conversations sometimes between
hardware and software people. "You gave me a block." "Nope,
just one record.") So when we write to one of these devices, we
only care about the characteristics of a block.
There are three kinds of blocks: Fixed, Variable, and Undefined.
When a fixed block is written, the physical record is always
equal to the block size except when there aren't enough records
to fill an entire block. In this case, the last block can be a
short block. Variable blocks are written on a block-by-block
basis, and each block can be a different length.
The length of each variable block is stored in the physical
record as the BDW, or Block Descriptor Word, and when there are
variable-length records, there's a corresponding RDW, or...you
guessed it...Record Descriptor Word. The BDW's length is left as
an exercise for the Alert Reader. (Want a hint? The maximum
block length for data is 32760 in MVS, not 32768. The actual
maximum length of a block itself is 32768, and is limited by (at
least) the specification of the block size in DEB as a signed
two-byte field. The hardware limit is established by the Count
fields in both Format 0 and Format 1 CCWs, and is 64K.)
Space and Block Length for FB
When fixed blocks are written to DASD, they start on a new track
or continue on a partly-used track only when there is enough
space left on the track to write the entire block. This means
that allocating FB data sets with block sizes above half the
track length is a guaranteed way to waste lots of space. Every
time two full-size blocks follow one another, the balance of the
track will be unused and the second full-size block written on
the next track.
Space and Block Length for VB
The above is entirely true for FB, but it's a slight
simplification for VB. For VB, the actual average block length
will dictate whether space utilization gets worse as the block
size rises. This will be a function of the size of the members
and the distribution of differently-sized members within each
data set. Since every new PDS member starts a new block, if all
the members are small a high block size won't actually hurt
anything. But if some or all the members are larger than 1/2
track, space utilization will get worse when the block size goes
over half a track.
Little Blocks are Bad
On the other hand, short block sizes are bad because space is
wasted in between each record for Count and Key fields on CKD
(Count, Key, and Data physical record formats) DASD, which is
what we use in MVS. To avoid wasting space between the records,
we want to use high block sizes.
Bigger Blocks are Better...to a Point
A reasonable compromise (for FB and VB) between too-short blocks
and too-long blocks is half the track length, which minimizes the
wastage on average. It's actually a bit more complicated than
that (pick up a DASD hardware book and a calculator for the gory
details), but DFSMSdfp's System Determined Blocksize, or SDB,
takes care of the complication and picks that value nearest half
a track that's right for the device and the block size specified.
More or less, anyway.
For most data sets, this comes very close to optimizing space
usage and performance. Not perfect for every data set, mind you,
but darn close for the overwhelming majority, and close enough
that trying to write code to figure out *all* the intricacies
would probably occupy someone in SVL for a lifetime (or two) and
is probably light-years from cost-justified. (For some reason,
those pesky programmers want to get *paid*. Sheesh!)
Distribution of Member Sizes and Loading Order
However, the mix of block sizes and the order the members are
loaded in can make SDB less-than-optimum for some data sets.
(Remember, too, that each PDS member starts a new block, there
are short blocks to think about, and some data sets are VB.) The
only consistent exception I've seen, though, is fonts (as [the
then-owner of the packaging rules] says, "Fonts are *always*
different"), and [the then-owner of the font FMIDs]'s got the
numbers to prove this to anyone who doubts it. Unless you're
really into pain, I suggest *not* asking [her] to give you the
numbers. They'll give you a headache. Really.
Use SDB...Most of the Time
So, fonts aside, SDB is *really* likely to be the right block
size to tell customers to use when allocating *almost* anything
but load libraries. There are other exceptions, too, like UADS,
but none of them are really software libraries. If you're doing
pretty standard stuff, use SDB. If you're doing something weird
(and UADS is pretty weird), check to see if one of your libraries
is an exception to the rule.
But--Wasn't This About Load Libraries?
Oh, yeah; those things! Load libraries containing load modules
have an undefined record format, RECFM=U. And their blocks are
also, well, um, undefined. They're written however the owner of
the code that writes them thinks they should be.
So there are no rules for Undefined blocks as a group. And there
are data sets using RECFM=U that aren't load libraries. I
haven't talked to the people that use such libraries, and have no
idea what block sizes might be optimum for them, individually or
as a group. (I'm not sure I even *want* to know.) Happily,
nobody is shipping any of these for system software, so I don't
have to understand them--yet.
But I did pester the owners of IEBCOPY, the linkage editor, and
Program Fetch at some length about load module block sizes.
Several times. I think I even understand most of what they've
told me now. Sorta scary, that, when I think about it.
Kinds of Load Module Records
Load modules (not Program Objects, which are stored only in
PDSEs) comprise a number of records each. There are one or more
ESD records, which are used by Program Fetch to resolve external
symbols. There are also RLD records, used to resolve relocatable
address constants. Then there are IDR records, and Control
records. These are all typically short. RLD and Control records
are interspersed throughout load modules, while ESD and IDR
records are at the beginning of load modules.
Then there are Text records, which make up the bulk of most load
modules. These contain the executable code, funny-looking
machine language stuff.
Maximum and Minimum Block Sizes for Text Records
When COPYMOD or the linkage editor writes a load module to a data
set, the allocation block size sets the *maximum* block size.
Short blocks are always written for RLD, ESD, Control, and IDR
records. More to the point, while writing Text records, RLD, and
Control records, a TRACKBAL macro is issued before writing each
block to see how much space is left on the track. If there's
enough space, a block is written that's up to the remaining space
on the track long, or as long as the maximum block size,
whichever is smaller.
There is also a minimum size block that will be written by either
utility. These restrictions set the *minimum* size of a text
block. The minimum block size that the linkage editor and binder
will try to write for text records is 1024 bytes.
Writing Text Records
When the space left on the track is more than the minimum block
length (1024 bytes), but less than the maximum block length, and
the text left to be written is more than 2048 bytes long, the
text can be split. What will fit on the track becomes the last
block of the track, and what won't fit on the track becomes the
first part of the first record, or the entire first record, on
the next track. This process is repeated for each block until
the end of the load module is reached. The next load module
starts in a new block, right after the block in which the
previous one ended.
So COPYMOD and the linkage editor do their best to stuff every
byte that will fit onto every track. Pretty neat, huh? *Someone*
was on the ball when this code was written!
Performance and Space Utilization
How much high load library block sizes help out performance and
space usage depends on how long the load modules in the data set
are. For example, the CSSLIB library is composed of small load
modules, all of which are currently 4K or smaller in size.
Increasing the block size of this data set past 4K does no good
at all. But--neither does it hurt. The same blocks will be
written in the same spots for any block size greater than or
equal to 4K.
On the other hand, this matters a great deal for, say, LINKLIB,
which has lots of big load modules. It keeps getting better
right up to the 32760 block size limit. The same is true of lots
of load libraries. Since 32760 never hurts, and lower block
sizes can, just recommending 32760 provides a single, consistent
value customers can use that's very often right, and *never*
wrong. (I originally qualified this statement, but despite many
challenges over a 5-year period, nobody has found an exception yet.)
So What?
But why does this matter? After all, DASD is cheaper by the day
(and, Hey!, we sell that stuff, too, don't we?).
Well, OS/390 takes over four 3390-3 volumes now, and it's still
not shrinking. I tested one data set in 1999, for DFSORT, and
found a 20% reduction in space utilization when the library was
blocked at 32760 vs. 6144.
20% is significant.
Program Fetch Performance
On "native" (non-emulated) DASD, another significant thing is the
corresponding 20% reduction in head switching (to read another
track, you've got to use another magnetic read head in the disk
drive), which in turn is a 1 1/3% reduction in seek time. 1 1/3%
might not sound like much, but a seek takes at least 1.5ms, which
is a Long Time to a computer. And the head switches and seeks
can take a *lot* longer than 1.5ms. Why, you ask?
Well, *since* you asked, Program Fetch tries to get a program off
DASD all at once. It doesn't know how long the module is when it
starts, so it gets the first few records in the first shot. They
tell it if there are more records, and later records can likewise
tell it about still more records to fetch. Then, on the fly, it
inserts CCWs into the channel program to read each successive
record. It does this using a Program-Controlled Interrupt (PCI)
design. If the processor is busy, and the Fetch task isn't
dispatched in time to insert the next CCW into the channel
program in time, the channel program ends prematurely.
This is a Bad Thing. The disk won't wait, and keeps spinning. By
the time the I/O is redriven, it's probably too late to catch the
next record without
waiting...for...the...disk...to...turn...all...the...way...around.
This takes 14ms on native 3390 DASD. This is 1,673 Dog Years
to a computer.
Having to wait for the disk to turn around, taking its sweet
time, is called an RPS miss. (No, not *that* RPS. No trucks are
involved. This RPS stands for Rotational Position Sensing.)
(Note: DASD control unit cache reduces the probability of RPS
misses for data on the same track.) We really hate it when this
happens. That low rumble you hear is users grumbling about
response time.
The probability of an RPS miss goes up with the number of records
used to write a load module. Because COPYMOD and the linkage
editor will always write a record if they can, block sizes below
32760 just make them write more records than they have to. So
the performance improvements are even greater than the space
utilization improvements, when you care the most about
performance--that is, when the system is Really Busy.
When the system is lightly loaded, performance is only worsened
by the greater number of tracks to read, which is only a couple
percent...but Program Fetch gets used a *lot*.
What about newer DASD devices? Well, they do head-switching
under the covers that is not apparent to the operating system,
and there are probably mini-RPS misses happening that are handled
by the control unit microcode. But these are not under our
control. It is still true that larger block sizes lower the
probability of missed PCI interrupts and having to redrive the
I/O. So larger block sizes still mean better performance.
--
John Eells
z/OS Technical Marketing
IBM Poughkeepsie
[EMAIL PROTECTED]
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html