The following message is a courtesy copy of an article that has been posted to bit.listserv.ibm-main,alt.folklore.computers as well.
[EMAIL PROTECTED] (Wertheim, Marty) writes: > The discussion on high speed buffers and internal cache misses is > right on target. I've got a set of benchmarks that are plain vanilla > COBOL programs - no DB2, or anything. One of them steps through 8 > tables of 16 MB each (hopefully not your typical COBOL program). On a > 2094-717, that program will run in 60 seconds of CPU time when the CEC > is 50% busy, 180 seconds of CPU time when the CEC is 95% busy. Other > programs using less memory have more stable CPU times, but even with a > program using 7MB, CPU times doubled when the CEC got up to 98%. If > anyone wants to send me an email off line, I can send you an Excel > showing the numbers I've seen. when caches first appeared ... they weren't large enuf to contain multiple contexts ... always loosing on interrupts which resulted burst of high cache misses changing context. one of the things that I did relatively early for large cache 370s ... was dynamically monitor interrupt rate ... and dynamically transition from running applications enabled for i/o interrupts to running disabled for i/o interrupts (when i/o interrupt rate was higher than some threshhold). this was trade-off between timeliness of handling i/o interrupts (responsiveness) and total thruput. above the i/o interrupt rate threshhold ... it was actually possible to process i/o interrupts faster when running disabled for i/o interrupts. running disabled for i/o interrupts tended to significantly improve application cache hit rate ... which translated into higher mip rate (instructions executed per second) and application executing in less elapsed time. at the same time, i/o interrupt handling tended to be "batched" ... which tended to improve the cache hit rate of the interrupt handlers ... making them run faster. it was possible to experience both increased aggregate thruput and faster aggregate i/o interrupt handling ... when running disabled for i/o interrupts ... and taking i/o interrupts at specific periods. this was part of what i got to ship in my resource manager http://www.garlic.com/~lynn/subtopic.html#fairshare also in the mid-70s, about the time future system got killed http://www.garlic.com/~lynn/subtopic.html#futuresys i was asked to do a smp project (which never shipped) which involved up to five 370 processor configuration ... and extensive microcode capability. as part of leveraging the microcode capability ... i put the multiprocessor dispatching function into microcode ... basically software managed the dispatch queue ... putting things on in specific order ... the multiprocessor microcode would pull stuff off the dispatch queue ... service it and at interrupt/interval ... move it to different queue (this was something similar seen in later intel 432). I also created a similar microcode managed queue for i/o operations ... something like what was later seen in 370-xa. http://www.garlic.com/~lynn/subtopic.html#bounce For standard 370 (two) multiprocessor cache machines ... i did two level operations for cache "affinity" ... attempting to keep processors executing on the same processor that they were previously executing on (and therefor being able to reuse information already loaded into cache). Traditional 370 (two-way) multiprocessor cache machines slowed down processor cycle by ten percent to accomodate cross-cache chatter between the two processors ... aka a 2-processor machine started out nominal "raw" thruput as 1.8times a single processor. System software multiprocessor overhead then would further reduce (application) thruput ... so there was typical rule-of-thumb that 2-processor had 1.3-1.5 times a single processor. I did some slight of hand in the system software for multiprocessor and processor affinity ... and had one two-processor configuration where one processor would chug along about 1.5times mip rate of uniprocessor (because of improved cache hit rate ... despite the machine cycle being ten percent slower) and the other processor abot .9times uniprocessor mip rate. The effective aggregate application thruput was slightly over twice uniprocessr ... because of some slight of hand in software implementation and cache affinity. The change-over to 3081 was supposed to only be a multiprocessor machine ... so there was never any need to mention the uniprocessor/multiprocessor hardware thruput differences. In the early 90s, machines started appearing where there were caches in the 1mbyte-4mbyte range ... this was larger than the real-storage sizes when i first started redoing virtual memory and page replacement algoritms as an undergraduate in the 60s ... recent post http://www.garlic.com/~lynn/2008c.html#65 No Glory for the PDP-15 other references http://www.garlic.com/~lynn/subtopic.html#wsclock Some of the major software vendors, like DBMS offerings ... were doing the equivalent of "page squeeze" performance tests ... based on cache sizes. This is old post with extract of presentation i gave at Aug68 share in boston .... it includes some results of lots of optimization work that I had been doing as undergraduate. Part of it was carefully redoing OS/360 stage-2 sysgen ... so i carefully placed datasets and PDS members on disk for optimal arm seek operation (speeding up "stand-alone" os/360 by almost three times for typical univ. workload). It also gives some amount of results of rewritting cp67 virtual machine pathlengths (speeding up various kernel paths by 10 to 100 times). However, it also includes a "page squeeze" test ... showing thruput of OS/360 running in virtual machine ... as the amount of real storage was reduced: http://www.garlic.com/~lynn/94.html#18 CP/67 and OS MFT14 One of the RDBMS vendors that we worked with on our ha/cmp product http://www.garlic.com/~lynn/subtopic.html#hacmp a meeting mentioned here http://www.garlic.com/~lynn/95.html#13 and some old email mentioning scaleup http://www.garlic.com/~lynn/lhwemail.html#medusa talked about extensive work that they were doing with processor vendors ... to make sure that server configurations included cache sizes large enuf to allow efficient thruput. At the time, they had a significant thuput jump between 2mbyte cache and 4mbyte cache (similar to thruput curve given in old Share presentation for OS/360 running in virtual machine as real storage was reduced). More recently, I had an application where i changed the default storage accessing pattern to one tailored for cache operation and got a five times increase in thruput (effectively increasing instruction executed per second, or MIP-rate, by factor of five times). I had also been involved in original relational/sql implementation http://www.garlic.com/~lynn/subtopic.html#systemr and getting it out as product. I was also involved in some other kinds of dbms implementations that are much more knowledge and "real-world" data oriented. I have my own implementation of such a beast ... and runnning on a 1.7mhz processor with 2mbyte cache ... it has twice the thruput compared to running on a 3.4mhz processor with a 512kbyte cache (at least for this application ... a 1.7mhz processor with 2mbyte cache has twice the effective MIP-rate of a 3.4mhz processor with 512kbyte cache). The penalty of a cache miss has similar effects on system thruput (and effective instructions executed per second ... or MIP rate) as oldtime page faults. for othe drift, i use our manager for "real-world" data http://www.garlic.com/~lynn/index.html for doing our ietf rfc index http://www.garlic.com/~lynn/rfcietff.htm and various merged taxonomies and glossaries http://www.garlic.com/~lynn/index.html#glosnote there are a number of benchmark applications available on the web that measure various characteristics of cache operation/efficiency ... which include accessing increasing amounts of real storage to find where operations are no longer contained in machine caches. I've mentioned before the VS/Repack product put out by the science center http://www.garlic.com/~lynn/subtopic.html#545tech which did detailed trace of instruction and storage use ... and profiled its operation for a virtual memory environment ... including being able to do semi-automated program reorganization (for improved virtual memory operation). VS/Repack precursors were used extensively by a number of internal product groups ... including IMS .. helping with the transition from real storage to virtual memory environment. Because of operation similarities between cache and virtual memory paging ... this type of analysis is still performed by many products (for optimizing both cache and virtual memory execution characteristics). recent post mentioning vs/repack: http://www.garlic.com/~lynn/2008c.html#24 Job ad for z/OS systems programmer trainee ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html