Chris and Charles, I get that instruction-level tuning isn't even remotely feasible or productive any more due to pipelining and complex cache issues. My own experience in performance enhancing projects has validated advice I was given in this forum may years ago: The only way you can test performance is to repeat your before-and-after tests 5-10 times and average the results. That is far from reassuring to the average programmer, especially when the development environment may be so constrained or loaded that it looks nothing like the "real" production environment in which the code to be tuned actually runs. There are lies, there are d***ed lies and there are statistics. Few programmers I know want to bet their job on averages.
However, that leaves regular application programmers in an untenable position when management says "reduce your MIPS/MSU's to control our costs" or "the batch window/the peak online volume" can't be met anymore because the application is too slow, so get in there and speed it up! Performance analysis tools available to ordinary application folk like Strobe or TriTune (just to mention two with which I have worked) can help identify glaring individual CPU hotspots, but of course tuning the overall application program design is always likely to bring more productive results than simple hotspot elimination. But there aren't any *design* analysis tools that I know of. Only years on the job and wide experience of "the way to do things" seems to be able to tackle that side of the problem. I once thought that the "design pattern" paradigm might provide some help, but as far as I can see it is useless in any practical sense for regular business application programming practice in HLL's and assembler. Understanding the z/Linux GNU C optimization parameters would probably give a careful assembler programmer strong hints on "best practice", but that seems untenable to do by hand, even at the level of a one-page "critical process". And obviously most HLL programmers don't have the option or even want to have the option tune at that level, deferring to the compiler's capabilities. So what is an ordinary programmer to do? Peter -----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Blaicher, Christopher Y. Sent: Thursday, December 24, 2015 12:22 PM To: [email protected] Subject: Re: Is there a source for detailed, instruction-level performance info? I have looked at the public documentation on the z13 and had the privilege to speak to some of the people behind parts of it, and it is an amazing machine. The reason you can't say how long an instruction takes is that in many cases things are happening A) out of sequence; B) at the same time and C) dependent on cache hit ratios. A z13 can be looking at up to about 50 instructions to see if there is anything it can do. If one of those instructions is not dependent on something yet to be done, it will do it and hold on to the result until needed. It also has many more registers than the 16 we think of, so if you have code using R1 and that is followed by a LHI R1,n instruction it may use R1 for the leading-in code and use one of the extra registers, call it register X27, to hold the value of the LHI. When that value is needed the X27 register is used instead. The machine remembers all this. Also, the z/13 can be working 6 instructions in parallel. One of the big pains for the z13 is an unpredictable branch. That is one that goes one way this time and the other way the next. The machine has a lot of branch prediction stuff (didn't know what else to call it) in it so that it tries to know where it will go, but if it predicts wrong, there is a 26 cycle cost, and when you consider that hits at least 6 instructions, that is a non-trivial expense. That brings us to cache hit ratios. A 1/10th of a second wait may seem like less than a blink of an eye, it is forever in our high speed machines. In that time all your code and data has probably been purged out of level 1 cache, and maybe out of level 2 cache. Bringing it back into cache takes time, a few cycles for level 1, and a factor more for each level away from level 1. Today it is impossible to say how long an instruction takes. It is even impossible to say how long a process takes because it varies based on what is in cache at the time. Another thing that effects things is do you get dispatched on the same processor or not. If not, then all the level 1 cache has to be reloaded. Bottom line, instruction speed is almost meaningless. You have to look at it from a workload perspective. Chris Blaicher Technical Architect Software Development Syncsort Incorporated 50 Tice Boulevard, Woodcliff Lake, NJ 07677 P: 201-930-8234 | M: 512-627-3803 E: [email protected] -----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Charles Mills Sent: Thursday, December 24, 2015 9:51 AM To: [email protected] Subject: Re: Is there a source for detailed, instruction-level performance info? Not so simple anymore. "How long does a store halfword take?" used to be a question that had an answer. It no longer does. My working rule of thumb (admittedly grossly oversimplified) is "instructions take no time, storage references take forever." I have heard it said that storage is the new DASD. This is true so much that the z13 processors implement a kind of "internal multiprogramming" so that one CPU internal thread can do something useful while another thread is waiting for a storage reference. Here is an example of how complex it is. I am responsible for an "event" or transaction driven program. I of course have test programs that will run events through the subject software. How many microseconds does each event consume? One surprising factor is how fast do you push the events through. If I max out the speed of event generation (as opposed to say, one event tenth of a second) then on a real-world shared Z the microseconds of CPU per event falls in HALF! Same exact sequence of instructions -- half the CPU time! Why? My presumption is that because if the program is running flat out it "owns" the caches and there is much less processor "wait" (for instruction and data fetch, not ECB type wait) time. Charles -----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Thomas Kern Sent: Wednesday, December 23, 2015 5:28 PM To: [email protected]<mailto:[email protected]> Subject: Re: Is there a source for detailed, instruction-level performance info? Perhaps what might be useful would be an assembler program to run loops of individual instructions and output some timing information. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected]<mailto:[email protected]> with the message: INFO IBM-MAIN ________________________________ ATTENTION: ----- The information contained in this message (including any files transmitted with this message) may contain proprietary, trade secret or other confidential and/or legally privileged information. Any pricing information contained in this message or in any files transmitted with this message is always confidential and cannot be shared with any third parties without prior written approval from Syncsort. This message is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any use, disclosure, copying or distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Syncsort and destroy all copies of this message in your possession, custody or control. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
