John P. Baker wrote:
Is it the fastest when run on current hardware? That may be impossible to
determine. Even with millicode involvement, the compactness of the code
will ensure that everything is cached, so that the resultant execution time
"may" equal or even "better" the non-millicode solutions which make use of
iterative loops.
The TR instruction will always be slower than equivalent instructions
for strings shorter than 'n' bytes because of the startup costs
associated with invoking a millicode instruction (subroutine) -- where
'n' might vary slightly from one hardware generation to the next.
When I first measured this phenomenon on a z800 processor, CASE1 below
ran faster than CASE2 for 'n' <= 9.)
|CASE1 DC 0H
| LA R2,9
| LA R3,DATA
| XR R4,R4
|CASE1L1 DS 0H
| IC R4,0(,R3)
| IC R4,EBCDIC(R4)
| STC R4,0(,R3)
| AHI R3,1
| JCT R2,CASE1L1
|CASE1L EQU *-CASE1
|CASE2 DC 0H
| TR DATA(9),EBCDIC
|CASE2L EQU *-CASE2
"Unrolling" the CASE1 loop as shown below (to avoid pipeline interlocks)
raised the 'n' threshold value to 24.
|Stride EQU 3
|CASE1 DC 0H
| LA R0,9/Stride
| LA R3,DATA
| XR R4,R4
| XR R5,R5
| XR R6,R6
|CASE1L1 DS 0H
| IC R4,0(,R3)
| IC R5,1(,R3)
| IC R6,2(,R3)
| IC R4,EBCDIC(R4)
| IC R5,EBCDIC(R5)
| IC R6,EBCDIC(R6)
| STC R4,0(,R3)
| STC R5,1(,R3)
| STC R6,2(,R3)
| AHI R3,Stride
| JCT R0,CASE1L1
|CASE1L EQU *-CASE1
For more information about zSeries millicode, see
http://researchweb.watson.ibm.com/journal/rd/483/heller.html.
--
Edward E Jaffe
Phoenix Software International, Inc
5200 W Century Blvd, Suite 800
Los Angeles, CA 90045
310-338-0400 x318
[EMAIL PROTECTED]
http://www.phoenixsoftware.com/
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html