Re: Long translate (TR) instruction?

Art Celestini Thu, 27 Mar 2008 14:05:32 -0700

William:

Thank you for taking the time to give this a try.  I had heard some horror
stories about TR performance being "disappointing" on some earlier Z/Arch
machines and I was wondering if it was pervasive.  Obviously not.


Not to be a nit-picker, but the OP (Kirk Wolf) said, "I'm looking for the 
fastest way in assembler to translate data in one buffer to another using 
a 256-byte translate table," which is part of what prompted me to suggest
the open-code solution that I did, since it includes a "move" from one 
buffer to another as part of the process.  I'm convinced that TRE and TR
are faster but it seems that a truly "fair" comparison of solutions to the 
stated problem should have included equivalent moves in the TRE and TR 
solutions.  

-- Art C.



At 05:33 AM 3/26/2008, William H. Blair wrote:
  
>Edward Jaffe wrote:
>
>> The following fragment should work if you prefer looping 
>> TRE over traditional TR. TRE requires you to manually 
>> translate the so-called "stop" character with an MVC. 
>> But, at least there's no EXecute for the final segment.
>>
>>   LM   R14,R15,xxxxxx           Load string ptr and its length
>>   LA   R1,xxxxxx                Ptr to translation table
>>   XR   R0,R0                    Set stop char = x'00'
>>   DO INF                        Do for translate
>>     TRE   R14,R1                  Translate the string
>>     DOEXIT Z                      Exit if no more data
>>     IF O                          If iterate needed
>>       ITERATE ,                     Process another segment
>>     ENDIF ,                       EndIf
>>     MVC   0(1,R14),0(R1)          Translate x'00' to whatever
>>     LA    R14,1(,R14)             Advance past stop character
>>     AHI   R15,-1                 Decrement length remaining
>>     DOEXIT NP                    Exit if no more data
>>   ENDDO ,                       EndDo for translate
>
>Art Celestini wrote:
>
>> It seems that the TRE instruction has been in z/Arch for at 
>> least a few years.  If anyone is inclined to try this:
>> 
>>       XR   R1,R1             Clear for insert
>>       L    R15,Length        Load string length
>> Loop  IC   R1,Input-1(R15)   Get input byte
>>       IC   R0,XlatTab(R1)    Get translated character ...
>>       STC  R0,Output-1(R15)  ... and store it in output
>>       BCT  R15,Loop          Decrement length & loop until done
>> 
>> it would be interesting to see how it fares against 
>> Ed Jaffe's code.
>
>I did this, since I had a program I could just plug these code
>segments into without doing a lot of work.  Results are below.
>
>> I believe the OP said that the data to be translated had to 
>> first be moved from one buffer to another.  The above does 
>> that, but a move of some type needs to be added to Ed's code 
>> to make it a true comparison.
>
>Maybe, maybe not. I've got code that needs to translate stuff
>in a buffer and it does not need it moved. And I have other
>code that first moves it and then translates it, because it
>doesn't want to clobber what it's translating. But, I did it
>both ways, just to find out for sure if it made a difference.
>It does not. The TRE loop is so much faster for any substantial
>number of bytes (which I define as more than 256, since that
>number or less can be handled directly, inline, simply by using 
>the TR instruction) that the overhead of even a MVCL does not
>even begin to eat into the gain by using a TRE loop. So, the
>fact that with a TRE loop subroutine or macro you might whip
>up you first have to move the data to be translated if you do
>not want the original data clobbered is simply not relevant
>from a performance perspective. Since there is no use for the 
>non-TRE loop subroutine (because its performance is horrible
>for any substantial number of bytes), we are left with the TRE
>or TR subroutines, which translate the data directly in the 
>buffer provided, which is what most programmers would want to 
>have available to call most of the time anyway, IMHO. If not, 
>then they would first have had to move the data to some other 
>buffer before TR'ing it anyway.
>
>As you will see below, the TRE loop was faster for me when I
>gave it more than 7 to 19 bytes. I'd never give it that few
>since for anything <= 256 I'd just code a TR inline. But if
>I didn't know how many bytes, then you can see that there is
>plenty of CPU time left to test for 256 or less and do a TR
>inline if so, or else call the TR[E] subroutine if I had more
>than 256.  Regardless, an ordinary TR loop is still faster 
>than using TRE. But this is what you would expect. The TR
>loop code is not any more complicated than the TRE loop code
>in the first place. It's just different. TRE does not replace
>TR. It's for another purpose, basically, not for performance.
>
>I revised the code above to suit my own personal taste and
>needs. I made an improvement in the TRE subroutine proposed 
>by Edward Jaffe to allow the caller to specify the "test" 
>character, so that performance will not suffer if the data 
>to be translated contains a lot of null bytes (as Ed's would). 
>That meant that the MVC had to become an IC + STC. 
>
>Here is the code for the subroutines I called repeatedly to
>gather the timing figures: 
>
>**------------------------------------------------------------------
>**                                                                   
>** NOTE: ENTER VIA    BAS   R8,NOTR     WITH REGS SET AS FOLLOWS:   
>**       R14 = INPUT BUFFER ADDRESS                                 
>**       R15 = OUTPUT BUFFER ADDRESS                                
>**       R0  = LENGTH OF BOTH INPUT AND OUTPUT BUFFER (MAY BE ZERO) 
>**       R1  = 256-BYTE TRANSLATE TABLE ADDRESS                     
>**                                                                   
>**------------------------------------------------------------------
>NOTR     LTR   R2,R0                COPY LENGTH AND TEST FOR ZERO       
>         BZR   R8                   RETURN IMMEDIATELY IF LENGTH=0      
>         BCTR  R15,0                1 BYTE IN FRONT OF OUTPUT BUFFER    
>         BCTR  R14,0                1 BYTE IN FRONT OF INPUT  BUFFER    
>         XR    R3,R3                CLEAR FOR IC (USED AS INDEX REG)    
>LOOP     IC    R3,0(R2,R14)         GET 1 BYTE STARTING FROM THE END    
>         IC    R0,0(R3,R1)          TRANSLATE THAT BYTE USING TABLE     
>         STC   R0,0(R2,R15)         PUT TRANLATED BYTE IN OUTPUT BFR    
>         BCT   R2,LOOP              ADJUST LENGTH LOOP IF MORE TO DO    
>         BR    R8                   RETURN TO CALLER                    
>                                                                        
>**------------------------------------------------------------------
>**                                                                  
>** NOTE: ENTER VIA    BAS   R8,TRE      WITH REGS SET AS FOLLOWS:   
>**       R14 = BUFFER ADDRESS                                       
>**       R15 = LENGTH OF BUFFER (MAY BE ZERO)                      
>** [LOB] R0  = TEST CHARACTER. CAN BE ANY CHARACTER. BUT FOR        
>**             PERFORMANCE REASONS, IT SHOULD BE ONE THAT IS        
>**             THE LEAST LIKELY TO OR SIMPLY DOES NOT APPEAR        
>**             IN THE BUFFER.  NOTE THAT, IN MOST INSTANCES,        
>**             THE X'00' (NULL) AND X'40' (BLANK) CHARACTERS        
>**             ARE NOT LIKELY TO BE THE BEST CHOICE FOR THIS.       
>**       R1  = 256-BYTE TRANSLATE TABLE ADDRESS                     
>**                                                                  
>**------------------------------------------------------------------
>TRE      LHI   R2,X'FF'                 SET R2 = X'000000FF'            
>         NR    R2,R0                    ISOLATE STOP CHARACTER          
>TREL     TRE   R14,R1                   TRANSLATE THE STRING            
>         BZR   R8                       RETURN IF NO MORE DATA          
>         BO    TREL                     REISSUE IF MORE TO DO           
>         IC    R3,0(R2,R1)              COPY BYTE IN TRANSLATE TABLE    
>         STC   R3,0(,R14)                AT OFFSET OF TEST CHARACTER    
>         LA    R14,1(,R14)              ADVANCE PAST TEST CHARACTER     
>         AHI   R15,-1                   DECREMENT LENGTH REMAINING      
>         BP    TREL                     GO PAST X'00' IF MORE DATA      
>         BR    R8                       RETURN TO CALLER  
>
>**-------------------------------------------------------------------
>**                                                                   
>** NOTE: ENTER VIA    BAS   R8,TR       WITH REGS SET AS FOLLOWS:    
>**       R14 = BUFFER ADDRESS                                        
>**       R15 = LENGTH OF BUFFER (MAY BE ZERO)                        
>**       R1  = 256-BYTE TRANSLATE TABLE ADDRESS                      
>**                                                                   
>**-------------------------------------------------------------------
>TR       AHI   R15,-1                  -1 FOR 0-ORIGIN               
>         BMR   R8                      RETURN IF LENGTH WAS NOT > 0  
>TRL      CHI   R15,256                 256 CHARS (OR LESS) REMAIN?   
>         BL    TRLX                    YES, GO TRANSLATE LAST PIECE  
>         TR    0(256,R14),0(R1)        NO, TRANSLATE FIRST/NEXT 256  
>         AHI   R15,-256                CALCULATE LENGTH REMAINING    
>         AHI   R14,256                 INCREMENT BUFFER POINTER      
>         B     TRL                     LOOP BACK TO DO NEXT 256 BYTES
>TRLX     EX    R15,TREX                TRANSLATE LAST PIECE OF BUFFER
>         BR    R8                                                    
>TREX     TRT   0(*-*,R14),0(R1)        TRANSLATE LESS THAN 256 BYTES      
> 
>
>The following are the results I obtained on a 2094-714 / z9-109:
>
>NO TR(E) USED14.434348 SEC TO XLATE X'0800' BYTES 1,000,000 TIMES
>TRE LOOP USED 1.213316 SEC TO XLATE X'0800' BYTES 1,000,000 TIMES
>TR  LOOP USED 1.030490 SEC TO XLATE X'0800' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 7.142552 SEC TO XLATE X'0400' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.683257 SEC TO XLATE X'0400' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.477014 SEC TO XLATE X'0400' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 3.578476 SEC TO XLATE X'0200' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.445753 SEC TO XLATE X'0200' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.200682 SEC TO XLATE X'0200' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 1.797026 SEC TO XLATE X'0100' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.239344 SEC TO XLATE X'0100' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031291 SEC TO XLATE X'0100' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 1.351607 SEC TO XLATE X'00C0' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.210312 SEC TO XLATE X'00C0' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031306 SEC TO XLATE X'00C0' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.906534 SEC TO XLATE X'0080' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.198645 SEC TO XLATE X'0080' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031321 SEC TO XLATE X'0080' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.462027 SEC TO XLATE X'0040' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.114426 SEC TO XLATE X'0040' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031289 SEC TO XLATE X'0040' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.238364 SEC TO XLATE X'0020' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.102487 SEC TO XLATE X'0020' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031304 SEC TO XLATE X'0020' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.224285 SEC TO XLATE X'001E' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.106776 SEC TO XLATE X'001E' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031290 SEC TO XLATE X'001E' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.210406 SEC TO XLATE X'001C' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.108267 SEC TO XLATE X'001C' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031289 SEC TO XLATE X'001C' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.196512 SEC TO XLATE X'001A' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.107565 SEC TO XLATE X'001A' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031326 SEC TO XLATE X'001A' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.182702 SEC TO XLATE X'0018' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.101648 SEC TO XLATE X'0018' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031298 SEC TO XLATE X'0018' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.168835 SEC TO XLATE X'0016' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.096894 SEC TO XLATE X'0016' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031372 SEC TO XLATE X'0016' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.154890 SEC TO XLATE X'0014' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.097152 SEC TO XLATE X'0014' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031300 SEC TO XLATE X'0014' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.140851 SEC TO XLATE X'0012' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.098532 SEC TO XLATE X'0012' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031297 SEC TO XLATE X'0012' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.126914 SEC TO XLATE X'0010' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.069528 SEC TO XLATE X'0010' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031306 SEC TO XLATE X'0010' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.113049 SEC TO XLATE X'000E' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.072118 SEC TO XLATE X'000E' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031289 SEC TO XLATE X'000E' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.099111 SEC TO XLATE X'000C' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.071591 SEC TO XLATE X'000C' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031296 SEC TO XLATE X'000C' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.085172 SEC TO XLATE X'000A' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.071743 SEC TO XLATE X'000A' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.031312 SEC TO XLATE X'000A' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.071273 SEC TO XLATE X'0008' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.068731 SEC TO XLATE X'0008' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029558 SEC TO XLATE X'0008' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.064337 SEC TO XLATE X'0007' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057373 SEC TO XLATE X'0007' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029557 SEC TO XLATE X'0007' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.057381 SEC TO XLATE X'0006' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057379 SEC TO XLATE X'0006' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029560 SEC TO XLATE X'0006' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.050431 SEC TO XLATE X'0005' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057365 SEC TO XLATE X'0005' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029547 SEC TO XLATE X'0005' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.043480 SEC TO XLATE X'0004' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057370 SEC TO XLATE X'0004' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029558 SEC TO XLATE X'0004' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.036505 SEC TO XLATE X'0003' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057389 SEC TO XLATE X'0003' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029558 SEC TO XLATE X'0003' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.029555 SEC TO XLATE X'0002' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057363 SEC TO XLATE X'0002' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029559 SEC TO XLATE X'0002' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.022616 SEC TO XLATE X'0001' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.057450 SEC TO XLATE X'0001' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.029030 SEC TO XLATE X'0001' BYTES 1,000,000 TIMES
>
>NO TR(E) USED 0.008419 SEC TO XLATE X'0000' BYTES 1,000,000 TIMES
>TRE LOOP USED 0.016833 SEC TO XLATE X'0000' BYTES 1,000,000 TIMES
>TR  LOOP USED 0.005217 SEC TO XLATE X'0000' BYTES 1,000,000 TIMES
>
>As you can see, for as little as 6 characters (on different runs 
>I made it was up to as little as 19 characters, so the breakpoint
>is anywhere between 5 and 24, I suspect), the non-TR[E] (that is,
>the byte-by-byte) version runs faster than the TRE code. But for 
>more than 256 characters the TRE loop certainly run a lot faster 
>than the byte-by-byte version!  As most bright folks here should 
>naturally expect, however, nothing beats an ordinary TR loop, in
>the same manner that an ordinary MVC loop usually beats an MVCL,
>except for a very large number of bytes. 
>
>If anybody wants the entire program so that they can run it on
>their machine, or change the data to be translated that I used,
>please let me know. It does not depend on any macros other than
>what is in SYS1.MACLIB.
>



==================================================
Art Celestini       Celestini Development Services
Phone: 201-670-1674                    Wyckoff, NJ
=============  http://celestini.com  =============
Mail sent to the "From" address  used in this post
will be rejected by our server.   Please send off-
list email to:  ibmmain<at-sign>celestini<dot>com.
==================================================

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: Long translate (TR) instruction?

Reply via email to