****************** Very long post warning *****************

I thought I'd posted this before, but didn't see it in the archives. This was originally written as a post to IBM's internal software packaging forum on the topic of block sizes to recommend for different system software data sets.

It's a bit dated now (written when SLED was the order of the day and RAID was new) but I think most of it still applies. One thing it doesn't include that I should have mentioned back then is the overhead of simply needing more CCWs to get the job done; this still applies even though some of the physical delays no longer do. But time to think about those updates and make them below is gone for this week, so...here 'tis, as it is.

Also, thanks to Darren for his help in allowing this extremely long post.

                      What's a Block, Anyway?

When doing I/O to a tape or DASD device, LRECL is irrelevant. Only the block size matters. This is because each physical record on tape and DASD is what we in software call a block. (This leads to interesting conversations sometimes between hardware and software people. "You gave me a block." "Nope, just one record.") So when we write to one of these devices, we only care about the characteristics of a block.

There are three kinds of blocks: Fixed, Variable, and Undefined. When a fixed block is written, the physical record is always equal to the block size except when there aren't enough records to fill an entire block. In this case, the last block can be a short block. Variable blocks are written on a block-by-block basis, and each block can be a different length.

The length of each variable block is stored in the physical record as the BDW, or Block Descriptor Word, and when there are variable-length records, there's a corresponding RDW, or...you guessed it...Record Descriptor Word. The BDW's length is left as an exercise for the Alert Reader. (Want a hint? The maximum block length for data is 32760 in MVS, not 32768. The actual maximum length of a block itself is 32768, and is limited by (at least) the specification of the block size in DEB as a signed two-byte field. The hardware limit is established by the Count fields in both Format 0 and Format 1 CCWs, and is 64K.)

                      Space and Block Length for FB

When fixed blocks are written to DASD, they start on a new track or continue on a partly-used track only when there is enough space left on the track to write the entire block. This means that allocating FB data sets with block sizes above half the track length is a guaranteed way to waste lots of space. Every time two full-size blocks follow one another, the balance of the track will be unused and the second full-size block written on the next track.

                      Space and Block Length for VB

The above is entirely true for FB, but it's a slight simplification for VB. For VB, the actual average block length will dictate whether space utilization gets worse as the block size rises. This will be a function of the size of the members and the distribution of differently-sized members within each data set. Since every new PDS member starts a new block, if all the members are small a high block size won't actually hurt anything. But if some or all the members are larger than 1/2 track, space utilization will get worse when the block size goes over half a track.

                         Little Blocks are Bad

On the other hand, short block sizes are bad because space is wasted in between each record for Count and Key fields on CKD (Count, Key, and Data physical record formats) DASD, which is what we use in MVS. To avoid wasting space between the records, we want to use high block sizes.

                  Bigger Blocks are Better...to a Point

A reasonable compromise (for FB and VB) between too-short blocks and too-long blocks is half the track length, which minimizes the wastage on average. It's actually a bit more complicated than that (pick up a DASD hardware book and a calculator for the gory details), but DFSMSdfp's System Determined Blocksize, or SDB, takes care of the complication and picks that value nearest half a track that's right for the device and the block size specified. More or less, anyway.

For most data sets, this comes very close to optimizing space usage and performance. Not perfect for every data set, mind you, but darn close for the overwhelming majority, and close enough that trying to write code to figure out *all* the intricacies would probably occupy someone in SVL for a lifetime (or two) and is probably light-years from cost-justified. (For some reason, those pesky programmers want to get *paid*. Sheesh!)

           Distribution of Member Sizes and Loading Order

However, the mix of block sizes and the order the members are loaded in can make SDB less-than-optimum for some data sets. (Remember, too, that each PDS member starts a new block, there are short blocks to think about, and some data sets are VB.) The only consistent exception I've seen, though, is fonts (as [the then-owner of the packaging rules] says, "Fonts are *always* different"), and [the then-owner of the font FMIDs]'s got the numbers to prove this to anyone who doubts it. Unless you're really into pain, I suggest *not* asking [her] to give you the numbers. They'll give you a headache. Really.

                    Use SDB...Most of the Time

So, fonts aside, SDB is *really* likely to be the right block size to tell customers to use when allocating *almost* anything but load libraries. There are other exceptions, too, like UADS, but none of them are really software libraries. If you're doing pretty standard stuff, use SDB. If you're doing something weird (and UADS is pretty weird), check to see if one of your libraries is an exception to the rule.

               But--Wasn't This About Load Libraries?

Oh, yeah; those things! Load libraries containing load modules have an undefined record format, RECFM=U. And their blocks are also, well, um, undefined. They're written however the owner of the code that writes them thinks they should be.

So there are no rules for Undefined blocks as a group. And there are data sets using RECFM=U that aren't load libraries. I haven't talked to the people that use such libraries, and have no idea what block sizes might be optimum for them, individually or as a group. (I'm not sure I even *want* to know.) Happily, nobody is shipping any of these for system software, so I don't have to understand them--yet.

But I did pester the owners of IEBCOPY, the linkage editor, and Program Fetch at some length about load module block sizes. Several times. I think I even understand most of what they've told me now. Sorta scary, that, when I think about it.

                  Kinds of Load Module Records

Load modules (not Program Objects, which are stored only in PDSEs) comprise a number of records each. There are one or more ESD records, which are used by Program Fetch to resolve external symbols. There are also RLD records, used to resolve relocatable address constants. Then there are IDR records, and Control records. These are all typically short. RLD and Control records are interspersed throughout load modules, while ESD and IDR records are at the beginning of load modules.

Then there are Text records, which make up the bulk of most load
modules. These contain the executable code, funny-looking machine language stuff.

         Maximum and Minimum Block Sizes for Text Records

When COPYMOD or the linkage editor writes a load module to a data set, the allocation block size sets the *maximum* block size. Short blocks are always written for RLD, ESD, Control, and IDR records. More to the point, while writing Text records, RLD, and Control records, a TRACKBAL macro is issued before writing each block to see how much space is left on the track. If there's enough space, a block is written that's up to the remaining space on the track long, or as long as the maximum block size, whichever is smaller.

There is also a minimum size block that will be written by either
utility. These restrictions set the *minimum* size of a text block. The minimum block size that the linkage editor and binder will try to write for text records is 1024 bytes.

                     Writing Text Records

When the space left on the track is more than the minimum block length (1024 bytes), but less than the maximum block length, and the text left to be written is more than 2048 bytes long, the text can be split. What will fit on the track becomes the last block of the track, and what won't fit on the track becomes the first part of the first record, or the entire first record, on the next track. This process is repeated for each block until the end of the load module is reached. The next load module starts in a new block, right after the block in which the previous one ended.

So COPYMOD and the linkage editor do their best to stuff every byte that will fit onto every track. Pretty neat, huh? *Someone* was on the ball when this code was written!

                 Performance and Space Utilization

How much high load library block sizes help out performance and space usage depends on how long the load modules in the data set are. For example, the CSSLIB library is composed of small load modules, all of which are currently 4K or smaller in size. Increasing the block size of this data set past 4K does no good at all. But--neither does it hurt. The same blocks will be written in the same spots for any block size greater than or equal to 4K.

On the other hand, this matters a great deal for, say, LINKLIB, which has lots of big load modules. It keeps getting better right up to the 32760 block size limit. The same is true of lots of load libraries. Since 32760 never hurts, and lower block sizes can, just recommending 32760 provides a single, consistent value customers can use that's very often right, and *never* wrong. (I originally qualified this statement, but despite many challenges over a 5-year period, nobody has found an exception yet.)

                           So What?

But why does this matter? After all, DASD is cheaper by the day (and, Hey!, we sell that stuff, too, don't we?).

Well, OS/390 takes over four 3390-3 volumes now, and it's still not shrinking. I tested one data set in 1999, for DFSORT, and found a 20% reduction in space utilization when the library was blocked at 32760 vs. 6144.

20% is significant.

                    Program Fetch Performance

On "native" (non-emulated) DASD, another significant thing is the
corresponding 20% reduction in head switching (to read another track, you've got to use another magnetic read head in the disk drive), which in turn is a 1 1/3% reduction in seek time. 1 1/3% might not sound like much, but a seek takes at least 1.5ms, which is a Long Time to a computer. And the head switches and seeks can take a *lot* longer than 1.5ms. Why, you ask?

Well, *since* you asked, Program Fetch tries to get a program off DASD all at once. It doesn't know how long the module is when it starts, so it gets the first few records in the first shot. They tell it if there are more records, and later records can likewise tell it about still more records to fetch. Then, on the fly, it inserts CCWs into the channel program to read each successive record. It does this using a Program-Controlled Interrupt (PCI) design. If the processor is busy, and the Fetch task isn't dispatched in time to insert the next CCW into the channel program in time, the channel program ends prematurely.

This is a Bad Thing. The disk won't wait, and keeps spinning. By the time the I/O is redriven, it's probably too late to catch the next record without waiting...for...the...disk...to...turn...all...the...way...around. This takes 14ms on native 3390 DASD. This is 1,673 Dog Years to a computer.

Having to wait for the disk to turn around, taking its sweet time, is called an RPS miss. (No, not *that* RPS. No trucks are involved. This RPS stands for Rotational Position Sensing.) (Note: DASD control unit cache reduces the probability of RPS misses for data on the same track.) We really hate it when this happens. That low rumble you hear is users grumbling about response time.

The probability of an RPS miss goes up with the number of records used to write a load module. Because COPYMOD and the linkage editor will always write a record if they can, block sizes below 32760 just make them write more records than they have to. So the performance improvements are even greater than the space utilization improvements, when you care the most about performance--that is, when the system is Really Busy.

When the system is lightly loaded, performance is only worsened by the greater number of tracks to read, which is only a couple percent...but Program Fetch gets used a *lot*.

What about newer DASD devices? Well, they do head-switching under the covers that is not apparent to the operating system, and there are probably mini-RPS misses happening that are handled by the control unit microcode. But these are not under our control. It is still true that larger block sizes lower the probability of missed PCI interrupts and having to redrive the I/O. So larger block sizes still mean better performance.

--
John Eells
z/OS Technical Marketing
IBM Poughkeepsie
[EMAIL PROTECTED]

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to