On Tue, 3 Jun 2014 15:27:48 +0000, DASDBILL2 wrote: >What if you measured the total cpu time consumed by code such as >the following to execute a truly huge number of only XR instructions >and then divide by the number of XR instructions executed? I would >think that this would be the smallest possible time for one XR; i.e., >the maximum possible pipelining with zero stalls.
I would expect that the code that you shown below would stall the pipeline for every XR instruction. Why? because they all use the same register. I wouldn't consider 128 million instructions to be "truly huge" >             LAY  R0,1000000 >             LA     R1,LOOP1 >* force alignment here to a 256-byte boundary; i.e., the length of a cache line >LOOP1  XR   R2,R2                first of 127 such XR >instructions >             XR   R2,R2                 >second of 127 such XR instructions >             ... >             XR   R2,R2               127th >and last of 127 such XR instructions >             BCTR R0,R1             execute the >previous 127 XR instructions one million times >* at this point, we have filled one cache line with 127 consecutive XR >instructions followed by the BCTR, and all 128 of these instructions >fit exactly within one cache line. >        ... end of loop. >When finished performing the loop, we will have executed 127,000,000 >XR instructions and 1,000,000 BCTR instructions. Ignore the time used >by the BCTR instructions. Divide total CPU time delta by 127,000,000 >to compute the approximate minimum time possible to do one XR >instruction. > >Then do the same thing for an SR, an SLR, and a LR that is loading a >register from another register that has been previously zeroed. This >technique could also be done with 63 consecutive LA Rx,0 instructions. If you are going to perform this test, I suggest that you run it ten times for each instruction. I'll bet that you get as much variation among the XR tests as you do between XR and SR. -- Tom Marchant
