Back from Christmas break to a lot of responses - thanks to all. Summarizing
the responses thus far:
* Select and/or tune your algorithm first.
Check.
* Tuning for a specific machine is a bad idea because you'll have to retune for
every new machine.
Check. This code is sufficiently critical that it's worth doing
platform-specific tuning. Otherwise I wouldn't be coding in assembler anyway.
And it's small enough to be manageable.
* The compilers are very good and can probably do better than you can.
Usually, I agree. On this small kernel, though, I'm already beating xlc and (by
a lesser margin) gcc. Looking at the code they generate has been enlightening;
gcc is especially aggressive on loop unrolling, and that's on my list of
things to try.
* Modern machines are too complex to enable detailed timing formulae to be
published.
I'm not really after detailed timing. I'm looking for implementation details of
the same sort used by compiler writers to guide selection of instruction
sequences, where the guiding principle is, "How can I avoid pipeline stalls?"
As I noted, several SHARE presentations contain SOME of this information, which
I've already benefited from, but I'm looking for more.
* Just write some timing loops and figure it out yourself.
I've been doing that, using mostly the algorithm itself plus small timing
experiments (which can be deceiving since they aren't in the context of the
rest of the algorithm). I've already managed to squeeze out about a 15%
improvement over my initial code.
An example: which of these code sequences do you suppose runs faster?
la rX,0(rI,rBase) rX -> array[i]
lg rY,0(,rX) rY = array[i]
agsi 0(rX),1 ++array[i]
* Now do something with rY
vs:
lg rY,0(rI,rBase) rY = array[i]
la rX,1(,rY) rX = rY + 1
stg rX,0(rI,rBase) effect is ++array[i]
* Now do something with rY
The first is substantially faster. I would have GUESSED that the second would
be faster, since I need the value in rY anyway. (I'm in 64-bit mode, so using
"LOAD ADDRESS for the increment is safe...)
* Suggestions regarding branch prediction.
Spot on, and I already did that, too. Luckily branch prediction is one of the
few performance characteristics that's actually semi-architectural (see the
description of "BRANCH PREDICTION PRELOAD" in the Principles of Operation).
Consider these two code sequences:
loop ds 0h
* Compute a value in r1
cgrjne r2,r1,loop continue if not done
vs:
loop ds 0h
* Compute a value in r1
cgrje r2,r1,loopexit exit if done
j loop
loopexit ds 0h
The second sequence is MUCH faster - though I haven't yet tried using "BRANCH
PREDICTION PRELOAD" to alter the default prediction.
* Cache issues
This kernel unfortunately has little (data) cache locality; it's inherent in
the problem. What I HAVE found is that using "PREFETCH DATA" and specifying
store intent before the first access DOES help a bit. I haven't fiddled with
"NEXT INSTRUCTION ACCESS INTENT" yet; that seems like a potentially enormous
rathole. Using MVCL isn't an option; my data elements are small, aligned
multiples of doublewords, and I determined very early on that a load/store loop
way outperforms both MVCL and MVC.
On the I-cache side: the kernel is very small (a few dozen instructions) and
chugs along for a good long time once it's called, so it's undoubtedly in
cache. Following David Bond's suggestion to align the tops of loops on 16-byte
boundaries helped a bit.
* The usual IBM-MAIN drift, reminiscences of old machines/code/operating
systems/management practices/etc.
Check. :-)
I'm still hoping that someone will reveal the existence of a document intended
for compiler writers...
-- Jerry
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN