All of these are pipeline issues. Bob Rodgers did a good presentation at Share in Boston of 2010, session 7534. There are several other papers out on the web that go into the internals of the machines in great detail.
First of all there are 2 types of instructions, those that are executed 'on the chip', i.e. are in the logic of the chip, and those that are millicoded, i.e. basically internal subroutines. Millicoded instructions are generally the more complex ones and are generally interruptible. Often, open code instruction sequences can beat a millicoded instruction. Another factor is pipeline pauses. There are several cycles lost (3 or 4?) between when a result is put in a register and when it can be used by a follow on instruction. There are more cycles lost (8 or 9?) between loading or modifying a register and using it in address resolution. It gets even more convoluted when you consider that the z9 and z10 are super-scalier machines, i.e. they try to process multiple instructions per cycle. The z9 is a factor 2 machine and the z10 is a factor 3 machine. The considerations for instruction sequence optimization become quite complex. In a few words, my advice is to use your favorite performance tool that shows you hot spots and then concentrate on them. I tell the people that I work with to ignore performance in one-time code, be aware of it at record level code, and think about it a sub-record level code. Your best friend is the performance tool. Christopher Y. Blaicher Senior Software Developer BMC Software, Inc. Austin Development Lab phone: 512.340.6154 mobile: 512.627.3803 fax: 512.340.6647 10431 Morado Circle Austin, TX 78759 -----Original Message----- From: IBM Mainframe Assembler List [mailto:[email protected]] On Behalf Of Fred van der Windt Sent: Tuesday, August 09, 2011 6:00 AM To: [email protected] Subject: Re: Pipeline question > it took me some time to actualy get what you do (not BASE64 - but the > RISBG) .... my guess it is the RISBG itself which is very slow and apparently > not very pipeline-freindly (and I have no idea about the reasons). > > How about using this > > L R1,0(,R4) Load 4 source bytes > AHI R4,4(,R4) R4 past 4 source bytes > SLDL R0,6 > SLL R1,2 > SLDL R0,6 > SLL R1,2 > SLDL R0,6 > SLL R1,2 > SLDL R0,6 > STCM R0,B'0111',0(R6) store 3 result bytes > AHI R6,3 > > I bet it is faster (and not only for the person looking at it). > > Not that I am against new stuff. I actualy have been reverting to z10 > instructions in code written for the public (from z11 instructions). > But there must be a gain for using them. The code you suggest is almost identical to the code I replaced with the RISBG sequence: DO FROM=(R2,) L R0,0(,R4) LA R4,4(,R4) SRDL R0,8 SRL R0,2 SRDL R0,6 SRL R0,2 SRDL R0,6 SRL R0,2 SRDL R0,6 STCM R1,B'1110',0(R6) LA R6,3(,R6) ENDDO I shifted right from R0 toward R1 and you're doing it the other way around but apart from that it is identical. And almost 2,5 times slower than the RISBG sequence! (I hope you're not a betting man). I guess the pipeline doesn't like this series of shift instructions either. It probably is just the fact that we're using 7 shift instructions to create 3 data bytes while the 7 RISBG instructions create 6 databytes (sounds like a factor 2) and we're only doing half the number of iterations (getting to almost 2,5?). Fred!
