> Possibly the existence of hardware that included such an assist. BTDTGTTS.

Hard to envisage the economics of an assist for an instruction whose worst case 
is a machine
cycle and whose best case is pre-cycle recognition.

Most BCTRs don't consume a cycle.  Their superfluousness is recognised in the 
pipe.  It's
quite phenomenal how many instructions are never "executed" in that sense, but 
are dealt with
in the pipe pre-execution.

Hard coding instruction loops for ultimate performance is almost always a waste 
of effort.

My butt got burnt many years ago with one of the first "slugged" machines (IBM 
hates the term
"knee-capping", so I use it whenever I can) which was the NAS AS/9040.  This 
was a Hitachi S9
processor sold unknee-capped as the AS/9060.  The knee-capping was done simply 
by adding null
cycles to the I-stage of certain frequently-used instructions.

One of our wonderful, delightful, charming customers decided to code his own 
synthetic kernel
to measure "MIPS".  Most will know my opinion of MIPS.  But this guy used a 
tight loop to
execute a selection of "long" instructions - a TRT, a MVCL, something 
floating-point, etc.,
and claimed to measure their performance.

It just so happened that the instruction following his target instruction was 
one of the main
ones that had the I-cycles added.  But it didn't make a difference worth a damn 
- his target
instructions took so long in the E-stage ´that the dummy cycles in the I-stage 
of the
following instruction had expired by its termination.

Now comes the fun bit.  He paid us $$$$$ for an upgrade to an AS/9060 and his 
synthetic kernel
ran at EXACTLY the same speed.  Obvious - we deleted dummy cycles out of an 
instruction that
was waiting for the E-stage in any case.

Cue lawyers.

His production workloads showed a performance improvement actually a little bit 
better than
we'd promised.  But his goddamn synthetic kernel showed no change at all - and 
he threatened
to sue us!

And having worked for Morino Associates and having been a founder member of CMG 
in the UK and
a past Vice-Chairman, I have a loathing for synthetic kernels that can be 
comfortably
described as passionate.  If not obsessive.  Don't do it.  You know not with 
what you mess.

IBM knee-capped its machines of that era simply by reducing the HSB.  The 3033S 
famously had a
high speed buffer of 512 bytes - around a quarter the capacity of each of the 
ten-cent 3270s
attached to it.  IBM proposed such a machine for the Finanzamt Charlottenburg 
in Berlin, and
actually succeeded in writing a benchmark that fitted in 512 bytes, so the 
thing ran like a
3033U.  I'd love to know what happened when that thing met their real workload.

Modern knee-capping is more comprehensive.

Epilogue:

There are always exceptions to every rule.  One day when I was out consulting 
as an ace
Assembler programmer, I had an internal client turn up with an interesting 
problem.  It was an
in-storage search mechanism against an ordered table.  He'd been looking at the 
table
structure and had come to the conclusion that classic binary chop might not be 
the best way -
something heuristic might be better.  Leonard da Pisa gave us a clue, and I 
wrote a really
tight RR routine.  It cut an hour off the overnight batch run on a /65, but the 
big surprises
came later when the hardware started reassigning GPRs.  Holy molly. Things have 
changed an
awful lot, but that routine still runs every night because of some side results 
it produces.
What used to take four hours now takes eight minutes.

-- 
  Phil Payne
  http://www.isham-research.co.uk
  +44 7833 654 800

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to