Re: Is there a source for detailed, instruction-level performance info?

Farley, Peter x23353 Thu, 24 Dec 2015 10:29:18 -0800

Chris and Charles,

I get that instruction-level tuning isn't even remotely feasible or productive 
any more due to pipelining and complex cache issues.  My own experience in 
performance enhancing projects has validated advice I was given in this forum 
may years ago:  The only way you can test performance is to repeat your 
before-and-after tests 5-10 times and average the results.  That is far from 
reassuring to the average programmer, especially when the development 
environment may be so constrained or loaded that it looks nothing like the 
"real" production environment in which the code to be tuned actually runs.  
There are lies, there are d***ed lies and there are statistics.  Few 
programmers I know want to bet their job on averages.


However, that leaves regular application programmers in an untenable position 
when management says "reduce your MIPS/MSU's to control our costs" or "the 
batch window/the peak online volume" can't be met anymore because the 
application is too slow, so get in there and speed it up!

Performance analysis tools available to ordinary application folk like Strobe 
or TriTune (just to mention two with which I have worked) can help identify 
glaring individual CPU hotspots, but of course tuning the overall application 
program design is always likely to bring more productive results than simple 
hotspot elimination.  But there aren't any *design* analysis tools that I know 
of.  Only years on the job and wide experience of "the way to do things" seems 
to be able to tackle that side of the problem.

I once thought that the "design pattern" paradigm might provide some help, but 
as far as I can see it is useless in any practical sense for regular business 
application programming practice in HLL's and assembler.

Understanding the z/Linux GNU C optimization parameters would probably give a 
careful assembler programmer strong hints on "best practice", but that seems 
untenable to do by hand, even at the level of a one-page "critical process".  
And obviously most HLL programmers don't have the option or even want to have 
the option tune at that level, deferring to the compiler's capabilities.

So what is an ordinary programmer to do?

Peter

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf 
Of Blaicher, Christopher Y.
Sent: Thursday, December 24, 2015 12:22 PM
To: [email protected]
Subject: Re: Is there a source for detailed, instruction-level performance info?

I have looked at the public documentation on the z13 and had the privilege to 
speak to some of the people behind parts of it, and it is an amazing machine.
The reason you can't say how long an instruction takes is that in many cases 
things are happening A) out of sequence; B) at the same time and C) dependent 
on cache hit ratios.
A z13 can be looking at up to about 50 instructions to see if there is anything 
it can do.  If one of those instructions is not dependent on something yet to 
be done, it will do it and hold on to the result until needed.  It also has 
many more registers than the 16 we think of, so if you have code using R1 and 
that is followed by a LHI R1,n instruction it may use R1 for the leading-in 
code and use one of the extra registers, call it register X27, to hold the 
value of the LHI.  When that value is needed the X27 register is used instead.  
The machine remembers all this.
Also, the z/13 can be working 6 instructions in parallel.
One of the big pains for the z13 is an unpredictable branch.  That is one that 
goes one way this time and the other way the next.  The machine has a lot of 
branch prediction stuff (didn't know what else to call it) in it so that it 
tries to know where it will go, but if it predicts wrong, there is a 26 cycle 
cost, and when you consider that hits at least 6 instructions, that is a 
non-trivial expense.
That brings us to cache hit ratios.  A 1/10th of a second wait may seem like 
less than a blink of an eye, it is forever in our high speed machines.  In that 
time all your code and data has probably been purged out of level 1 cache, and 
maybe out of level 2 cache.  Bringing it back into cache takes time, a few 
cycles for level 1, and a factor more for each level away from level 1.
Today it is impossible to say how long an instruction takes.  It is even 
impossible to say how long a process takes because it varies based on what is 
in cache at the time.
Another thing that effects things is do you get dispatched on the same 
processor or not.  If not, then all the level 1 cache has to be reloaded.
Bottom line, instruction speed is almost meaningless.  You have to look at it 
from a workload perspective.

Chris Blaicher
Technical Architect
Software Development
Syncsort Incorporated
50 Tice Boulevard, Woodcliff Lake, NJ 07677
P: 201-930-8234  |  M: 512-627-3803
E: [email protected]


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf 
Of Charles Mills
Sent: Thursday, December 24, 2015 9:51 AM
To: [email protected]
Subject: Re: Is there a source for detailed, instruction-level performance info?

Not so simple anymore.

"How long does a store halfword take?" used to be a question that had an 
answer. It no longer does.

My working rule of thumb (admittedly grossly oversimplified) is "instructions 
take no time, storage references take forever." I have heard it said that 
storage is the new DASD. This is true so much that the z13 processors implement 
a kind of "internal multiprogramming" so that one CPU internal thread can do 
something useful while another thread is waiting for a storage reference.

Here is an example of how complex it is. I am responsible for an "event" or 
transaction driven program. I of course have test programs that will run events 
through the subject software. How many microseconds does each event consume? 
One surprising factor is how fast do you push the events through.
If I max out the speed of event generation (as opposed to say, one event tenth 
of a second) then on a real-world shared Z the microseconds of CPU per event 
falls in HALF! Same exact sequence of instructions -- half the CPU time! Why? 
My presumption is that because if the program is running flat out it "owns" the 
caches and there is much less processor "wait" (for instruction and data fetch, 
not ECB type wait) time.

Charles
-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf 
Of Thomas Kern
Sent: Wednesday, December 23, 2015 5:28 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is there a source for detailed, instruction-level performance info?

Perhaps what might be useful would be an assembler program to run loops of 
individual instructions and output some timing information.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send email to 
[email protected]<mailto:[email protected]> with the message: 
INFO IBM-MAIN



  ________________________________



ATTENTION: -----

The information contained in this message (including any files transmitted with 
this message) may contain proprietary, trade secret or other confidential 
and/or legally privileged information. Any pricing information contained in 
this message or in any files transmitted with this message is always 
confidential and cannot be shared with any third parties without prior written 
approval from Syncsort. This message is intended to be read only by the 
individual or entity to whom it is addressed or by their designee. If the 
reader of this message is not the intended recipient, you are on notice that 
any use, disclosure, copying or distribution of this message, in any form, is 
strictly prohibited. If you have received this message in error, please 
immediately notify the sender and/or Syncsort and destroy all copies of this 
message in your possession, custody or control.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please notify us immediately by e-mail and delete the message and any 
attachments from your system.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Is there a source for detailed, instruction-level performance info?

Reply via email to