Re: CPU time differences for the same job

Anne & Lynn Wheeler Tue, 05 Feb 2008 07:58:39 -0800

The following message is a courtesy copy of an article
that has been posted to bit.listserv.ibm-main,alt.folklore.computers as well.

[EMAIL PROTECTED] (Wertheim, Marty) writes:
> The discussion on high speed buffers and internal cache misses is
> right on target.  I've got a set of benchmarks that are plain vanilla
> COBOL programs - no DB2, or anything.  One of them steps through 8
> tables of 16 MB each (hopefully not your typical COBOL program).  On a
> 2094-717, that program will run in 60 seconds of CPU time when the CEC
> is 50% busy, 180 seconds of CPU time when the CEC is 95% busy.  Other
> programs using less memory have more stable CPU times, but even with a
> program using 7MB, CPU times doubled when the CEC got up to 98%.  If
> anyone wants to send me an email off line, I can send you an Excel
> showing the numbers I've seen.

when caches first appeared ... they weren't large enuf to contain
multiple contexts ... always loosing on interrupts which resulted burst
of high cache misses changing context.

one of the things that I did relatively early for large cache 370s ...
was dynamically monitor interrupt rate ... and dynamically transition
from running applications enabled for i/o interrupts to running disabled
for i/o interrupts (when i/o interrupt rate was higher than some
threshhold). this was trade-off between timeliness of handling i/o
interrupts (responsiveness) and total thruput. above the i/o interrupt
rate threshhold ... it was actually possible to process i/o interrupts
faster when running disabled for i/o interrupts.

running disabled for i/o interrupts tended to significantly improve
application cache hit rate ... which translated into higher mip rate
(instructions executed per second) and application executing in less
elapsed time. at the same time, i/o interrupt handling tended to be
"batched" ... which tended to improve the cache hit rate of the
interrupt handlers ... making them run faster. it was possible to
experience both increased aggregate thruput and faster aggregate i/o
interrupt handling ... when running disabled for i/o interrupts ... and
taking i/o interrupts at specific periods. this was part of what i got
to ship in my resource manager
http://www.garlic.com/~lynn/subtopic.html#fairshare

also in the mid-70s, about the time future system got killed
http://www.garlic.com/~lynn/subtopic.html#futuresys

i was asked to do a smp project (which never shipped) which involved up
to five 370 processor configuration ... and extensive microcode
capability. as part of leveraging the microcode capability ... i put the
multiprocessor dispatching function into microcode ... basically
software managed the dispatch queue ... putting things on in specific
order ... the multiprocessor microcode would pull stuff off the dispatch
queue ... service it and at interrupt/interval ... move it to different
queue (this was something similar seen in later intel 432). I also
created a similar microcode managed queue for i/o operations
... something like what was later seen in 370-xa.
http://www.garlic.com/~lynn/subtopic.html#bounce

For standard 370 (two) multiprocessor cache machines ... i did two level
operations for cache "affinity" ... attempting to keep processors
executing on the same processor that they were previously executing on
(and therefor being able to reuse information already loaded into
cache). Traditional 370 (two-way) multiprocessor cache machines slowed
down processor cycle by ten percent to accomodate cross-cache chatter
between the two processors ... aka a 2-processor machine started out
nominal "raw" thruput as 1.8times a single processor. System software
multiprocessor overhead then would further reduce (application) thruput
... so there was typical rule-of-thumb that 2-processor had 1.3-1.5
times a single processor. 

I did some slight of hand in the system software for multiprocessor and
processor affinity ... and had one two-processor configuration where one
processor would chug along about 1.5times mip rate of uniprocessor
(because of improved cache hit rate ... despite the machine cycle being
ten percent slower) and the other processor abot .9times uniprocessor
mip rate. The effective aggregate application thruput was slightly over
twice uniprocessr ... because of some slight of hand in software
implementation and cache affinity.

The change-over to 3081 was supposed to only be a multiprocessor machine
... so there was never any need to mention the
uniprocessor/multiprocessor hardware thruput differences.

In the early 90s, machines started appearing where there were caches in
the 1mbyte-4mbyte range ... this was larger than the real-storage sizes
when i first started redoing virtual memory and page replacement
algoritms as an undergraduate in the 60s ... recent post
http://www.garlic.com/~lynn/2008c.html#65 No Glory for the PDP-15
other references
http://www.garlic.com/~lynn/subtopic.html#wsclock

Some of the major software vendors, like DBMS offerings ... were doing
the equivalent of "page squeeze" performance tests ... based on cache
sizes. This is old post with extract of presentation i gave at Aug68
share in boston .... it includes some results of lots of optimization
work that I had been doing as undergraduate. Part of it was carefully
redoing OS/360 stage-2 sysgen ... so i carefully placed datasets and PDS
members on disk for optimal arm seek operation (speeding up
"stand-alone" os/360 by almost three times for typical univ. workload).
It also gives some amount of results of rewritting cp67 virtual machine
pathlengths (speeding up various kernel paths by 10 to 100
times). However, it also includes a "page squeeze" test ... showing
thruput of OS/360 running in virtual machine ... as the amount of
real storage was reduced:
http://www.garlic.com/~lynn/94.html#18 CP/67 and OS MFT14

One of the RDBMS vendors that we worked with on our ha/cmp product
http://www.garlic.com/~lynn/subtopic.html#hacmp
a meeting mentioned here
http://www.garlic.com/~lynn/95.html#13
and some old email mentioning scaleup
http://www.garlic.com/~lynn/lhwemail.html#medusa

talked about extensive work that they were doing with processor vendors
... to make sure that server configurations included cache sizes large
enuf to allow efficient thruput. At the time, they had a significant
thuput jump between 2mbyte cache and 4mbyte cache (similar to thruput
curve given in old Share presentation for OS/360 running in virtual
machine as real storage was reduced).

More recently, I had an application where i changed the default storage
accessing pattern to one tailored for cache operation and got a five
times increase in thruput (effectively increasing instruction executed
per second, or MIP-rate, by factor of five times).

I had also been involved in original relational/sql implementation
http://www.garlic.com/~lynn/subtopic.html#systemr

and getting it out as product. I was also involved in some other kinds
of dbms implementations that are much more knowledge and "real-world"
data oriented. I have my own implementation of such a beast ... and
runnning on a 1.7mhz processor with 2mbyte cache ... it has twice the
thruput compared to running on a 3.4mhz processor with a 512kbyte cache
(at least for this application ... a 1.7mhz processor with 2mbyte cache
has twice the effective MIP-rate of a 3.4mhz processor with 512kbyte
cache).

The penalty of a cache miss has similar effects on system thruput (and
effective instructions executed per second ... or MIP rate) as oldtime
page faults.

for othe drift, i use our manager for "real-world" data 
http://www.garlic.com/~lynn/index.html

for doing our ietf rfc index
http://www.garlic.com/~lynn/rfcietff.htm

and various merged taxonomies and glossaries
http://www.garlic.com/~lynn/index.html#glosnote

there are a number of benchmark applications available on the web that
measure various characteristics of cache operation/efficiency ... which
include accessing increasing amounts of real storage to find where
operations are no longer contained in machine caches.

I've mentioned before the VS/Repack product put out by the science
center
http://www.garlic.com/~lynn/subtopic.html#545tech

which did detailed trace of instruction and storage use ... and profiled
its operation for a virtual memory environment ... including being able
to do semi-automated program reorganization (for improved virtual memory
operation). VS/Repack precursors were used extensively by a number of
internal product groups ... including IMS .. helping with the transition
from real storage to virtual memory environment. Because of operation
similarities between cache and virtual memory paging ... this type of
analysis is still performed by many products (for optimizing both cache
and virtual memory execution characteristics). recent post mentioning
vs/repack: 
http://www.garlic.com/~lynn/2008c.html#24 Job ad for z/OS systems programmer 
trainee

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: CPU time differences for the same job

Reply via email to