Art Celestini wrote:
It seems that the TRE instruction has been in z/Arch for at least a few
years.  If anyone is inclined to try this, it would be interesting to see
how it fares against Ed Jaffe's code:

      XR   R1,R1             Clear for insert
      L    R15,Length        Load string length
Loop  IC   R1,Input-1(R15)   Get input byte
      IC   R0,XlatTab(R1)    Get translated character ...
      STC  R0,Output-1(R15)  ... and store it in output
      BCT  R15,Loop          Decrement length & loop until done

I believe the OP said that the data to be translated had to first be moved
from one buffer to another.  The above does that, but a move of some type
needs to be added to Ed's code to make it a true comparison.

Some years ago, on our z800 processor, we measured the performance of (in-place) TR against a software-coded loop. We found that the loop was faster than TR for strings shorter than nine (9) bytes in length. When we spoke to IBM about this, we learned that TR had been partially moved into millicode for the z900/z800. It ran slower for short strings because of the millicode start/stop (aka "subroutine linkage") costs. For strings longer than nine bytes, TR was faster because it had access to a hardware facility that could translate two bytes per cycle. The code fragments we compared were:

     |CASE1    DC    0H
     |         LA    R2,9
     |         LA    R3,DATA
     |         XR    R4,R4
     |CASE1L1  DS    0H
     |         IC    R4,0(,R3)
     |         IC    R4,EBCDIC(R4)
     |         STC   R4,0(,R3)
     |         AHI   R4,1
     |         AHI   R3,1
     |         JCT   R2,CASE1L1
     |CASE1L   EQU   *-CASE1


     |CASE2    DC    0H
     |         TR    DATA(9),EBCDIC
     |CASE2L   EQU   *-CASE2


We later "unrolled" the loop, interleaving the use of three different registers, and found it was now faster than TR for strings of 24 bytes or fewer!

     |Stride   EQU   3
     |CASE1    DC    0H
     |         LA    R0,9/Stride
     |         LA    R3,DATA
     |         XR    R4,R4
     |         XR    R5,R5
     |         XR    R6,R6
     |CASE1L1  DS    0H
     |         IC    R4,0(,R3)
     |         IC    R5,1(,R3)
     |         IC    R6,2(,R3)
     |         IC    R4,EBCDIC(R4)
     |         IC    R5,EBCDIC(R5)
     |         IC    R6,EBCDIC(R6)
     |         STC   R4,0(,R3)
     |         STC   R5,1(,R3)
     |         STC   R6,2(,R3)
     |         AHI   R3,Stride
     |         JCT   R0,CASE1L1
     |CASE1L   EQU   *-CASE1


The results of the above experiments suggest that your loop has an excellent chance of being faster than *any* sequence involving TR or TRE, for strings shorter than some number of bytes 'n', on any given hardware generation supporting z/Architecture.

--
Edward E Jaffe
Phoenix Software International, Inc
5200 W Century Blvd, Suite 800
Los Angeles, CA 90045
310-338-0400 x318
[EMAIL PROTECTED]
http://www.phoenixsoftware.com/

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to