All of these are pipeline issues.  Bob Rodgers did a good presentation at Share 
in Boston of 2010, session 7534.  There are several other papers out on the web 
that go into the internals of the machines in great detail.

First of all there are 2 types of instructions, those that are executed 'on the 
chip', i.e. are in the logic of the chip, and those that are millicoded, i.e. 
basically internal subroutines.  Millicoded instructions are generally the more 
complex ones and are generally interruptible.  Often, open code instruction 
sequences can beat a millicoded instruction.

Another factor is pipeline pauses.  There are several cycles lost (3 or 4?) 
between when a result is put in a register and when it can be used by a follow 
on instruction.  There are more cycles lost (8 or 9?) between loading or 
modifying a register and using it in address resolution.

It gets even more convoluted when you consider that the z9 and z10 are 
super-scalier machines, i.e. they try to process multiple instructions per 
cycle.  The z9 is a factor 2 machine and the z10 is a factor 3 machine.  The 
considerations for instruction sequence optimization become quite complex.

In a few words, my advice is to use your favorite performance tool that shows 
you hot spots and then concentrate on them.  I tell the people that I work with 
to ignore performance in one-time code, be aware of it at record level code, 
and think about it a sub-record level code.  Your best friend is the 
performance tool.


Christopher Y. Blaicher
Senior Software Developer
BMC Software, Inc.
Austin Development Lab

phone: 512.340.6154
mobile: 512.627.3803
fax: 512.340.6647

10431 Morado Circle
Austin, TX 78759



-----Original Message-----
From: IBM Mainframe Assembler List [mailto:[email protected]] On 
Behalf Of Fred van der Windt
Sent: Tuesday, August 09, 2011 6:00 AM
To: [email protected]
Subject: Re: Pipeline question

> it took me some time to actualy get what you do (not BASE64 - but the
> RISBG) .... my guess it is the RISBG itself which is very slow and apparently 
> not very pipeline-freindly (and I have no idea about the reasons).
>
> How about using this
>
>       L    R1,0(,R4)       Load 4 source bytes
>       AHI  R4,4(,R4)       R4 past 4 source bytes
>       SLDL R0,6
>       SLL  R1,2
>       SLDL R0,6
>       SLL  R1,2
>       SLDL R0,6
>       SLL  R1,2
>       SLDL R0,6
>       STCM R0,B'0111',0(R6)  store 3 result bytes
>       AHI  R6,3
>
> I bet it is faster (and not only for the person looking at it).
>
> Not that I am against new stuff. I actualy have been reverting to z10 
> instructions in code written for the public (from z11 instructions).
> But there must be a gain for using them.

The code you suggest is almost identical to the code I replaced with the RISBG 
sequence:

  DO    FROM=(R2,)
    L     R0,0(,R4)
    LA    R4,4(,R4)
    SRDL  R0,8
    SRL   R0,2
    SRDL  R0,6
    SRL   R0,2
    SRDL  R0,6
    SRL   R0,2
    SRDL  R0,6
    STCM  R1,B'1110',0(R6)
    LA    R6,3(,R6)
  ENDDO

I shifted right from R0 toward R1 and you're doing it the other way around but 
apart from that it is identical. And almost 2,5 times slower than the RISBG 
sequence! (I hope you're not a betting man). I guess the pipeline doesn't like 
this series of shift instructions either. It probably is just the fact that 
we're using 7 shift instructions to create 3 data bytes while the 7 RISBG 
instructions create 6 databytes (sounds like a factor 2) and we're only doing 
half the number of iterations (getting to almost 2,5?).

Fred!

Reply via email to