Re: Long translate (TR) instruction?

2008-03-27 Thread Art Celestini
William:

Thank you for taking the time to give this a try.  I had heard some horror
stories about TR performance being disappointing on some earlier Z/Arch
machines and I was wondering if it was pervasive.  Obviously not.

Not to be a nit-picker, but the OP (Kirk Wolf) said, I'm looking for the 
fastest way in assembler to translate data in one buffer to another using 
a 256-byte translate table, which is part of what prompted me to suggest
the open-code solution that I did, since it includes a move from one 
buffer to another as part of the process.  I'm convinced that TRE and TR
are faster but it seems that a truly fair comparison of solutions to the 
stated problem should have included equivalent moves in the TRE and TR 
solutions.  

-- Art C.



At 05:33 AM 3/26/2008, William H. Blair wrote:
  
Edward Jaffe wrote:

 The following fragment should work if you prefer looping 
 TRE over traditional TR. TRE requires you to manually 
 translate the so-called stop character with an MVC. 
 But, at least there's no EXecute for the final segment.

   LM   R14,R15,xx   Load string ptr and its length
   LA   R1,xxPtr to translation table
   XR   R0,R0Set stop char = x'00'
   DO INFDo for translate
 TRE   R14,R1  Translate the string
 DOEXIT Z  Exit if no more data
 IF O  If iterate needed
   ITERATE , Process another segment
 ENDIF ,   EndIf
 MVC   0(1,R14),0(R1)  Translate x'00' to whatever
 LAR14,1(,R14) Advance past stop character
 AHI   R15,-1 Decrement length remaining
 DOEXIT NPExit if no more data
   ENDDO ,   EndDo for translate

Art Celestini wrote:

 It seems that the TRE instruction has been in z/Arch for at 
 least a few years.  If anyone is inclined to try this:
 
   XR   R1,R1 Clear for insert
   LR15,LengthLoad string length
 Loop  IC   R1,Input-1(R15)   Get input byte
   IC   R0,XlatTab(R1)Get translated character ...
   STC  R0,Output-1(R15)  ... and store it in output
   BCT  R15,Loop  Decrement length  loop until done
 
 it would be interesting to see how it fares against 
 Ed Jaffe's code.

I did this, since I had a program I could just plug these code
segments into without doing a lot of work.  Results are below.

 I believe the OP said that the data to be translated had to 
 first be moved from one buffer to another.  The above does 
 that, but a move of some type needs to be added to Ed's code 
 to make it a true comparison.

Maybe, maybe not. I've got code that needs to translate stuff
in a buffer and it does not need it moved. And I have other
code that first moves it and then translates it, because it
doesn't want to clobber what it's translating. But, I did it
both ways, just to find out for sure if it made a difference.
It does not. The TRE loop is so much faster for any substantial
number of bytes (which I define as more than 256, since that
number or less can be handled directly, inline, simply by using 
the TR instruction) that the overhead of even a MVCL does not
even begin to eat into the gain by using a TRE loop. So, the
fact that with a TRE loop subroutine or macro you might whip
up you first have to move the data to be translated if you do
not want the original data clobbered is simply not relevant
from a performance perspective. Since there is no use for the 
non-TRE loop subroutine (because its performance is horrible
for any substantial number of bytes), we are left with the TRE
or TR subroutines, which translate the data directly in the 
buffer provided, which is what most programmers would want to 
have available to call most of the time anyway, IMHO. If not, 
then they would first have had to move the data to some other 
buffer before TR'ing it anyway.

As you will see below, the TRE loop was faster for me when I
gave it more than 7 to 19 bytes. I'd never give it that few
since for anything = 256 I'd just code a TR inline. But if
I didn't know how many bytes, then you can see that there is
plenty of CPU time left to test for 256 or less and do a TR
inline if so, or else call the TR[E] subroutine if I had more
than 256.  Regardless, an ordinary TR loop is still faster 
than using TRE. But this is what you would expect. The TR
loop code is not any more complicated than the TRE loop code
in the first place. It's just different. TRE does not replace
TR. It's for another purpose, basically, not for performance.

I revised the code above to suit my own personal taste and
needs. I made an improvement in the TRE subroutine proposed 
by Edward Jaffe to allow the caller to specify the test 
character, so that performance will not suffer if the data 
to be translated contains a lot of null bytes (as Ed's would). 
That meant that the MVC 

Re: Long translate (TR) instruction?

2008-03-27 Thread William H. Blair
Art Celestini wrote:

 I'm convinced that TRE and TR are faster but it seems that 
 a truly fair comparison of solutions to the stated problem 
 should have included equivalent moves in the TRE and TR 
 solutions. 

I did write and run versions with the code like that. And, I 
said so:

|   I've got code that needs to translate stuff
| in a buffer and it does not need it moved. And I have other
| code that first moves it and then translates it, because it
| doesn't want to clobber what it's translating. But, I did it
| both ways, just to find out for sure if it made a difference.
| It does not.

But since you asked, I added those into the mix, so you can 
see and judge for yourself:

 TIME (IN SECONDS) FOR 001,000,000 REPETITIONS OF:
 -
--BYTES-  NO TR(E)  TRE INPL  TRE MVC   TR  INPL  TR  MVC  
 - - - - - 
0800 14.939655  1.245189  1.642310  1.082476  1.236875 
0400  7.162529  0.731567  0.971124  0.487941  0.580783 
0200  3.593004  0.461754  0.673962  0.206117  0.268123 
0100  1.802772  0.253433  0.342846  0.032038  0.050725 
00C0  1.355390  0.240958  0.311724  0.031969  0.048488 
0080  0.909253  0.210573  0.276103  0.031942  0.046119 
0040  0.463195  0.150320  0.164585  0.032047  0.043604 
0020  0.238923  0.101492  0.113927  0.032032  0.042417 
001E  0.225827  0.111231  0.122245  0.032019  0.042544 
001C  0.210944  0.110432  0.122021  0.031966  0.042432 
001A  0.197080  0.110823  0.122119  0.031953  0.042508 
0018  0.183400  0.104318  0.116599  0.031982  0.042673 
0016  0.169207  0.099349  0.110853  0.031980  0.042465 
0014  0.155477  0.100393  0.109962  0.032081  0.042704 
0012  0.141733  0.099860  0.111362  0.031961  0.042495 
0010  0.127308  0.070471  0.083389  0.031962  0.041866 
000E  0.113336  0.074843  0.086993  0.031981  0.041867 
000C  0.099318  0.073958  0.086677  0.031962  0.041833 
000A  0.085462  0.074848  0.086733  0.032057  0.041985 
0008  0.071609  0.069932  0.081476  0.030228  0.038990 
0007  0.064623  0.058755  0.068647  0.030245  0.039025 
0006  0.057541  0.058729  0.068720  0.030278  0.038971 
0005  0.050582  0.058701  0.068568  0.030230  0.038931 
0004  0.043603  0.058764  0.068620  0.030246  0.039029 
0003  0.036664  0.058748  0.068683  0.030220  0.038934 
0002  0.029665  0.058824  0.068732  0.030386  0.039100 
0001  0.022716  0.059113  0.069109  0.029829  0.038662 
  0.005250  0.016894  0.005825  0.005239  0.005835 

TESTNAME  DESCRIPTION
  
NO TR(E)  Basic move and translate, one byte at a time
TRE INPL  TRE loop in-place
TRE MVC   TRE loop buffer-to-buffer move first
TR  INPL  TR  loop in-place
TR  MVC   TR  loop buffer-to-buffer move first  

TR is always faster than TRE. Having to move the data
from an input buffer to a separate output buffer for
translation increases the CPU time required by ~15%.

That is still way less than the overhead of the basic
move and translate, which is the fastest technique 
only for 0, 1, 2, or 3 bytes (for more than 3 bytes,
the basic TR loop, or even the TR loop with the data
to be translated having to be moved to the output 
buffer first, is fastest).

The above figures include the equivalent moves to make 
it a 'truly fair comparison of solutions to the stated 
problem'. It reflects what I have always observed about
such tests: a well-coded, basic, tight MVC loop (or an
MVCL) is pretty fast compared to almost anything else
that involves a half-dozen or so instructions that do
virtually anything. Thus, counting the CPU time that is
required to move the data to a separate buffer as part
of the overhead doesn't actually add that much to the
CPU time required to get the whole job done. 

I suspect that this is simply due to the fact that MVC
and MVCL are already pretty well-optimized for the job
they do. Even a basic, tight loop will be limited by 
some performance constraint, probably by the rate at
which instructions whose execution cannot be overlapped
can be pumped through the machine (in contrast to blobs 
of data MVCing and TR[T]ing thru the wires all as part 
of one instruction).

Today, for all intents and purposes, the time required 
to execute any given standard instruction is the same 
as any other. This is because the work to be done can be
done in the available time, before another instruction
is fetched and shoved through the internal machinery.
The instructions which process more than a word or two 
of data take longer, of course. But some of those are
very highly optimized (in hardware -- for example, the
LM and STM instructions are no longer pigs. They are,
in fact, fairly effective substitutes for MVC, except
that you toast the contents of several registers when
you use enough to make it worthwhile.  

Thus, optimization in our 

Re: Long translate (TR) instruction?

2008-03-27 Thread Art Celestini
William:

Thanks (again).  I found these results most interesting. 

Art


At 11:18 PM 3/27/2008, William H. Blair wrote:
  
Art Celestini wrote:

 I'm convinced that TRE and TR are faster but it seems that 
 a truly fair comparison of solutions to the stated problem 
 should have included equivalent moves in the TRE and TR 
 solutions. 

I did write and run versions with the code like that. And, I 
said so:

|   I've got code that needs to translate stuff
| in a buffer and it does not need it moved. And I have other
| code that first moves it and then translates it, because it
| doesn't want to clobber what it's translating. But, I did it
| both ways, just to find out for sure if it made a difference.
| It does not.

But since you asked, I added those into the mix, so you can 
see and judge for yourself:

 TIME (IN SECONDS) FOR 001,000,000 REPETITIONS OF:
 -
--BYTES-  NO TR(E)  TRE INPL  TRE MVC   TR  INPL  TR  MVC  
 - - - - - 
0800 14.939655  1.245189  1.642310  1.082476  1.236875 
0400  7.162529  0.731567  0.971124  0.487941  0.580783 
0200  3.593004  0.461754  0.673962  0.206117  0.268123 
0100  1.802772  0.253433  0.342846  0.032038  0.050725 
00C0  1.355390  0.240958  0.311724  0.031969  0.048488 
0080  0.909253  0.210573  0.276103  0.031942  0.046119 
0040  0.463195  0.150320  0.164585  0.032047  0.043604 
0020  0.238923  0.101492  0.113927  0.032032  0.042417 
001E  0.225827  0.111231  0.122245  0.032019  0.042544 
001C  0.210944  0.110432  0.122021  0.031966  0.042432 
001A  0.197080  0.110823  0.122119  0.031953  0.042508 
0018  0.183400  0.104318  0.116599  0.031982  0.042673 
0016  0.169207  0.099349  0.110853  0.031980  0.042465 
0014  0.155477  0.100393  0.109962  0.032081  0.042704 
0012  0.141733  0.099860  0.111362  0.031961  0.042495 
0010  0.127308  0.070471  0.083389  0.031962  0.041866 
000E  0.113336  0.074843  0.086993  0.031981  0.041867 
000C  0.099318  0.073958  0.086677  0.031962  0.041833 
000A  0.085462  0.074848  0.086733  0.032057  0.041985 
0008  0.071609  0.069932  0.081476  0.030228  0.038990 
0007  0.064623  0.058755  0.068647  0.030245  0.039025 
0006  0.057541  0.058729  0.068720  0.030278  0.038971 
0005  0.050582  0.058701  0.068568  0.030230  0.038931 
0004  0.043603  0.058764  0.068620  0.030246  0.039029 
0003  0.036664  0.058748  0.068683  0.030220  0.038934 
0002  0.029665  0.058824  0.068732  0.030386  0.039100 
0001  0.022716  0.059113  0.069109  0.029829  0.038662 
  0.005250  0.016894  0.005825  0.005239  0.005835 

TESTNAME  DESCRIPTION
  
NO TR(E)  Basic move and translate, one byte at a time
TRE INPL  TRE loop in-place
TRE MVC   TRE loop buffer-to-buffer move first
TR  INPL  TR  loop in-place
TR  MVC   TR  loop buffer-to-buffer move first  

TR is always faster than TRE. Having to move the data
from an input buffer to a separate output buffer for
translation increases the CPU time required by ~15%.

That is still way less than the overhead of the basic
move and translate, which is the fastest technique 
only for 0, 1, 2, or 3 bytes (for more than 3 bytes,
the basic TR loop, or even the TR loop with the data
to be translated having to be moved to the output 
buffer first, is fastest).

The above figures include the equivalent moves to make 
it a 'truly fair comparison of solutions to the stated 
problem'. It reflects what I have always observed about
such tests: a well-coded, basic, tight MVC loop (or an
MVCL) is pretty fast compared to almost anything else
that involves a half-dozen or so instructions that do
virtually anything. Thus, counting the CPU time that is
required to move the data to a separate buffer as part
of the overhead doesn't actually add that much to the
CPU time required to get the whole job done. 

I suspect that this is simply due to the fact that MVC
and MVCL are already pretty well-optimized for the job
they do. Even a basic, tight loop will be limited by 
some performance constraint, probably by the rate at
which instructions whose execution cannot be overlapped
can be pumped through the machine (in contrast to blobs 
of data MVCing and TR[T]ing thru the wires all as part 
of one instruction).

Today, for all intents and purposes, the time required 
to execute any given standard instruction is the same 
as any other. This is because the work to be done can be
done in the available time, before another instruction
is fetched and shoved through the internal machinery.
The instructions which process more than a word or two 
of data take longer, of course. But some of those are
very highly optimized (in hardware -- for example, the
LM and STM instructions are no longer pigs. They are,
in fact, fairly effective substitutes for MVC, except

Re: Long translate (TR) instruction?

2008-03-26 Thread William H. Blair
Edward Jaffe wrote:

 The following fragment should work if you prefer looping 
 TRE over traditional TR. TRE requires you to manually 
 translate the so-called stop character with an MVC. 
 But, at least there's no EXecute for the final segment.

   LM   R14,R15,xx   Load string ptr and its length
   LA   R1,xxPtr to translation table
   XR   R0,R0Set stop char = x'00'
   DO INFDo for translate
 TRE   R14,R1  Translate the string
 DOEXIT Z  Exit if no more data
 IF O  If iterate needed
   ITERATE , Process another segment
 ENDIF ,   EndIf
 MVC   0(1,R14),0(R1)  Translate x'00' to whatever
 LAR14,1(,R14) Advance past stop character
 AHI   R15,-1 Decrement length remaining
 DOEXIT NPExit if no more data
   ENDDO ,   EndDo for translate

Art Celestini wrote:

 It seems that the TRE instruction has been in z/Arch for at 
 least a few years.  If anyone is inclined to try this:
 
   XR   R1,R1 Clear for insert
   LR15,LengthLoad string length
 Loop  IC   R1,Input-1(R15)   Get input byte
   IC   R0,XlatTab(R1)Get translated character ...
   STC  R0,Output-1(R15)  ... and store it in output
   BCT  R15,Loop  Decrement length  loop until done
 
 it would be interesting to see how it fares against 
 Ed Jaffe's code.

I did this, since I had a program I could just plug these code
segments into without doing a lot of work.  Results are below.

 I believe the OP said that the data to be translated had to 
 first be moved from one buffer to another.  The above does 
 that, but a move of some type needs to be added to Ed's code 
 to make it a true comparison.

Maybe, maybe not. I've got code that needs to translate stuff
in a buffer and it does not need it moved. And I have other
code that first moves it and then translates it, because it
doesn't want to clobber what it's translating. But, I did it
both ways, just to find out for sure if it made a difference.
It does not. The TRE loop is so much faster for any substantial
number of bytes (which I define as more than 256, since that
number or less can be handled directly, inline, simply by using 
the TR instruction) that the overhead of even a MVCL does not
even begin to eat into the gain by using a TRE loop. So, the
fact that with a TRE loop subroutine or macro you might whip
up you first have to move the data to be translated if you do
not want the original data clobbered is simply not relevant
from a performance perspective. Since there is no use for the 
non-TRE loop subroutine (because its performance is horrible
for any substantial number of bytes), we are left with the TRE
or TR subroutines, which translate the data directly in the 
buffer provided, which is what most programmers would want to 
have available to call most of the time anyway, IMHO. If not, 
then they would first have had to move the data to some other 
buffer before TR'ing it anyway.

As you will see below, the TRE loop was faster for me when I
gave it more than 7 to 19 bytes. I'd never give it that few
since for anything = 256 I'd just code a TR inline. But if
I didn't know how many bytes, then you can see that there is
plenty of CPU time left to test for 256 or less and do a TR
inline if so, or else call the TR[E] subroutine if I had more
than 256.  Regardless, an ordinary TR loop is still faster 
than using TRE. But this is what you would expect. The TR
loop code is not any more complicated than the TRE loop code
in the first place. It's just different. TRE does not replace
TR. It's for another purpose, basically, not for performance.

I revised the code above to suit my own personal taste and
needs. I made an improvement in the TRE subroutine proposed 
by Edward Jaffe to allow the caller to specify the test 
character, so that performance will not suffer if the data 
to be translated contains a lot of null bytes (as Ed's would). 
That meant that the MVC had to become an IC + STC. 

Here is the code for the subroutines I called repeatedly to
gather the timing figures: 

**--
**   
** NOTE: ENTER VIABAS   R8,NOTR WITH REGS SET AS FOLLOWS:   
**   R14 = INPUT BUFFER ADDRESS 
**   R15 = OUTPUT BUFFER ADDRESS
**   R0  = LENGTH OF BOTH INPUT AND OUTPUT BUFFER (MAY BE ZERO) 
**   R1  = 256-BYTE TRANSLATE TABLE ADDRESS 
**   
**--
NOTR LTR   R2,R0COPY LENGTH AND TEST FOR ZERO   
 

Re: Long translate (TR) instruction?

2008-03-26 Thread Shmuel Metz (Seymour J.)
In [EMAIL PROTECTED], on
03/24/2008
   at 12:30 PM, Kirk Wolf [EMAIL PROTECTED] said:

I'm looking for the fastest way in assembler to translate data in one
buffer to another using a 256-byte translate table.

The fastest way on one model may not be the fastest way on another model.

Any advise on the fastest instruction path to do this would be
appreciated.

Time seveal approaches on the specific box you're targetting. 
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-26 Thread William H. Blair
Kirk Wolf said:

 I'm looking for the fastest way in assembler to 
 translate data in one buffer to another using a 
 256-byte translate table.

Want my test program to help you decide? Let me know.
But don't waste your time. I already know the answer.

Look at my TR subroutine in a previous post as a place
to get started (if you need that).

Shmuel Metz (Seymour J.) said:

 The fastest way on one model may not be the fastest 
 way on another model.

True.  But -- I just knew you were expecting a but -- I
have been looking at this off and on for about 8 years,
and have had access to most (if not all) models of zXXX
hardware (currently I have access to a 2094, 2096, 2086
and a 2066). I have NEVER found an instruction sequence
that would run faster than a simple old-fashioned TR[T] 
(or MVC or CLC) loop on ANY z model machine - except an 
MVCL or CLCL for a very large number of bytes.  Since
very little code like this is on a performance-critical
path, I mostly just use whatever is convenient; in such
a case it does not really matter. If I believe the code
is on a performance-critical path I'll use a subroutine
that does it the old-fashioned way (TR/TRT/CLC/MVC loop
or whatever), unless I have special knowledge that lots
of bytes (more than 4KB) need to be MVCed/CLCed.  Thus,
if Mr. Wolf currently has a z box (Duh!) I can tell him
that the answer to that question -- TODAY -- is just do
an old-fashioned MVC loop (or an MVCL) to move the data
to the buffer where one will need it after translation, 
and then use an old-fashioned TR loop to actually do it
in that (output) buffer. On any z box that exists today
that is the fastest way. And I bet it stays that way in
the future, probably forever. Why? There is very little
that microcode/millicode can do faster than the current
raw, basic machine can do with these fundamental S/360-
era instructions. The same basic internal operations to
get the job done have to be done in each instance so it
does not matter whether the orders are coming from code
or millicode/microcode. Now, if the machine offered the
TR[T]L instructions, then probably -- just as it is the
case for MVCL and CLCL -- those would run just a little
faster than an old-fashioned basic TR[T] loop, but only
for large numbers of bytes. But we don't have TR[T]L so 
the System/360 instructions are still the fastest way. 

--
WB

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-25 Thread (IBM Mainframe Discussion List)
 
 
In a message dated 3/24/2008 2:10:15 P.M. Central Daylight Time,  
[EMAIL PROTECTED] writes:
Even if the z10 offered a Translate Extended instruction, the OP  couldn't 
count on it being there on every Customer's machine for quite a  while.
 
The OP can use dual paths.  If executing on machines without the newer  
instruction, then use TR; if the newer instruction is available, then use  it.  
But 
don't put the test inside the loop.
 
Nor is there any guarantee that IBM won't redesign the internals of  whatever 
today is the fastest way to do something so that on a future processor  it is 
slower, as in changing microcode into millicode.
 
Bill  Fairchild
Rocket Software





**Create a Home Theater Like the Pros. Watch the video on AOL 
Home.  
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom000301)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-25 Thread Art Celestini
It seems that the TRE instruction has been in z/Arch for at least a few
years.  If anyone is inclined to try this, it would be interesting to see
how it fares against Ed Jaffe's code:

  XR   R1,R1 Clear for insert
  LR15,LengthLoad string length
Loop  IC   R1,Input-1(R15)   Get input byte
  IC   R0,XlatTab(R1)Get translated character ...
  STC  R0,Output-1(R15)  ... and store it in output
  BCT  R15,Loop  Decrement length  loop until done

I believe the OP said that the data to be translated had to first be moved
from one buffer to another.  The above does that, but a move of some type
needs to be added to Ed's code to make it a true comparison.

--Art C.


At 03:42 PM 3/24/2008, Edward Jaffe wrote:
  
McKown, John wrote:
I don't think you have a choice, in the general case. That is because
all the new TRxx type instructions seem to terminate when the data in
your buffer equals to the contents of the low order byte general
register 0. I.e. they stop at an end of buffer type character, like a
null in a C string. If you can tolerate this behaviour, then I'd look
at the TRE or TROO instruction. The TRE seems easier to use, to me.
  

The following fragment should work if you prefer looping TRE over traditional 
TR. TRE requires you to manually translate the so-called stop character with 
an MVC. But, at least there's no EXecute for the final segment.

   LM   R14,R15,xx   Load string ptr and its length
   LA   R1,xxPtr to translation table
   XR   R0,R0Set stop char = x'00'
   DO INFDo for translate
 TRE   R14,R1  Translate the string
 DOEXIT Z  Exit if no more data
 IF O  If iterate needed
   ITERATE , Process another segment
 ENDIF ,   EndIf
 MVC   0(1,R14),0(R1)  Translate x'00' to whatever
 LAR14,1(,R14) Advance past stop character
 AHI   R15,-1 Decrement length remaining
 DOEXIT NPExit if no more data
   ENDDO ,   EndDo for translate



==
Art Celestini   Celestini Development Services
Phone: 201-670-1674Wyckoff, NJ
=  http://celestini.com  =
Mail sent to the From address  used in this post
will be rejected by our server.   Please send off-
list email to:  ibmmainat-signcelestinidotcom.
==

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-25 Thread Edward Jaffe

Art Celestini wrote:

It seems that the TRE instruction has been in z/Arch for at least a few
years.  If anyone is inclined to try this, it would be interesting to see
how it fares against Ed Jaffe's code:

  XR   R1,R1 Clear for insert
  LR15,LengthLoad string length
Loop  IC   R1,Input-1(R15)   Get input byte
  IC   R0,XlatTab(R1)Get translated character ...
  STC  R0,Output-1(R15)  ... and store it in output
  BCT  R15,Loop  Decrement length  loop until done

I believe the OP said that the data to be translated had to first be moved
from one buffer to another.  The above does that, but a move of some type
needs to be added to Ed's code to make it a true comparison.
  


Some years ago, on our z800 processor, we measured the performance of 
(in-place) TR against a software-coded loop. We found that the loop was 
faster than TR for strings shorter than nine (9) bytes in length. When 
we spoke to IBM about this, we learned that TR had been partially moved 
into millicode for the z900/z800. It ran slower for short strings 
because of the millicode start/stop (aka subroutine linkage) costs. 
For strings longer than nine bytes, TR was faster because it had access 
to a hardware facility that could translate two bytes per cycle. The 
code fragments we compared were:


 |CASE1DC0H
 | LAR2,9
 | LAR3,DATA
 | XRR4,R4
 |CASE1L1  DS0H
 | ICR4,0(,R3)
 | ICR4,EBCDIC(R4)
 | STC   R4,0(,R3)
 | AHI   R4,1
 | AHI   R3,1
 | JCT   R2,CASE1L1
 |CASE1L   EQU   *-CASE1


 |CASE2DC0H
 | TRDATA(9),EBCDIC
 |CASE2L   EQU   *-CASE2


We later unrolled the loop, interleaving the use of three different 
registers, and found it was now faster than TR for strings of 24 bytes 
or fewer!


 |Stride   EQU   3
 |CASE1DC0H
 | LAR0,9/Stride
 | LAR3,DATA
 | XRR4,R4
 | XRR5,R5
 | XRR6,R6
 |CASE1L1  DS0H
 | ICR4,0(,R3)
 | ICR5,1(,R3)
 | ICR6,2(,R3)
 | ICR4,EBCDIC(R4)
 | ICR5,EBCDIC(R5)
 | ICR6,EBCDIC(R6)
 | STC   R4,0(,R3)
 | STC   R5,1(,R3)
 | STC   R6,2(,R3)
 | AHI   R3,Stride
 | JCT   R0,CASE1L1
 |CASE1L   EQU   *-CASE1


The results of the above experiments suggest that your loop has an 
excellent chance of being faster than *any* sequence involving TR or 
TRE, for strings shorter than some number of bytes 'n', on any given 
hardware generation supporting z/Architecture.


--
Edward E Jaffe
Phoenix Software International, Inc
5200 W Century Blvd, Suite 800
Los Angeles, CA 90045
310-338-0400 x318
[EMAIL PROTECTED]
http://www.phoenixsoftware.com/

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Long translate (TR) instruction?

2008-03-24 Thread Kirk Wolf
Hi,

I'm looking for the fastest way in assembler to translate data in one buffer
to another using a 256-byte translate table.
The TR instruction is only up to 256 bytes, and I can't figure out if one of
the newer instructions is a replacement for arbitrary length translations,
or if the best approach is just to loop for 256 byte chunks.  The average
length transaction is almost certainly less than 256 bytes.

Any advise on the fastest instruction path to do this would be appreciated.

Thanks,
Kirk Wolf
Dovetailed Technologies

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-24 Thread john gilmore
Look at the TRTE, TRanslate and Test Extended, instruction on pp. 7-231ff of 
the current PROP.

Looping is still required when a condition-code value of 3 is set, but only a 
branch back to the same, already executed TRTE is required to accomplish it.   
In particular, there is no requirement for a running count of the number of 
bytes that remain to be translated. 

John Gilmore
Ashland, MA 01721-1817
USA


 Date: Mon, 24 Mar 2008 12:30:51 -0500
 From: [EMAIL PROTECTED]
 Subject: Long translate (TR) instruction?
 To: IBM-MAIN@bama.ua.edu
 
 Hi,
 
 I'm looking for the fastest way in assembler to translate data in one buffer
 to another using a 256-byte translate table.
 The TR instruction is only up to 256 bytes, and I can't figure out if one of
 the newer instructions is a replacement for arbitrary length translations,
 or if the best approach is just to loop for 256 byte chunks.  The average
 length transaction is almost certainly less than 256 bytes.
 
 Any advise on the fastest instruction path to do this would be appreciated.
 
 Thanks,
 Kirk Wolf
 Dovetailed Technologies
 
 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html

_
Watch “Cause Effect,” a show about real people making a real difference.  Learn 
more.
http://im.live.com/Messenger/IM/MTV/?source=text_watchcause

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-24 Thread Art Celestini
TRT[E] is not the same as TR.  It *stops* if the translated-to byte is 
non-zero.

It's hard to say whether a TR loop or just an open code loop would be better 
on the current in-use crop of hardware.  If CPU performance is extremely 
important, I'd do some experimenting.  Even if the z10 offered a Translate 
Extended instruction, the OP couldn't count on it being there on every 
Customer's machine for quite a while.  It sounded like the data had to be moved 
in addition to translated, so an open code solution might handle both the 
move and the translate (one byte at a time).  

--Art C.

At 02:22 PM 3/24/2008, john gilmore wrote:
  
Look at the TRTE, TRanslate and Test Extended, instruction on pp. 7-231ff of 
the current PROP.

Looping is still required when a condition-code value of 3 is set, but only 
a branch back to the same, already executed TRTE is required to accomplish it. 
  In particular, there is no requirement for a running count of the number of 
bytes that remain to be translated. 



==
Art Celestini   Celestini Development Services
Phone: 201-670-1674Wyckoff, NJ
=  http://celestini.com  =
Mail sent to the From address  used in this post
will be rejected by our server.   Please send off-
list email to:  ibmmainat-signcelestinidotcom.
==

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-24 Thread McKown, John
 -Original Message-
 From: IBM Mainframe Discussion List 
 [mailto:[EMAIL PROTECTED] On Behalf Of Kirk Wolf
 Sent: Monday, March 24, 2008 12:31 PM
 To: IBM-MAIN@bama.ua.edu
 Subject: Long translate (TR) instruction?
 
 
 Hi,
 
 I'm looking for the fastest way in assembler to translate 
 data in one buffer
 to another using a 256-byte translate table.
 The TR instruction is only up to 256 bytes, and I can't 
 figure out if one of
 the newer instructions is a replacement for arbitrary length 
 translations,
 or if the best approach is just to loop for 256 byte chunks.  
 The average
 length transaction is almost certainly less than 256 bytes.
 
 Any advise on the fastest instruction path to do this would 
 be appreciated.
 
 Thanks,
 Kirk Wolf
 Dovetailed Technologies

I don't think you have a choice, in the general case. That is because
all the new TRxx type instructions seem to terminate when the data in
your buffer equals to the contents of the low order byte general
register 0. I.e. they stop at an end of buffer type character, like a
null in a C string. If you can tolerate this behaviour, then I'd look
at the TRE or TROO instruction. The TRE seems easier to use, to me.


--
John McKown
Senior Systems Programmer
HealthMarkets
Keeping the Promise of Affordable Coverage
Administrative Services Group
Information Technology

The information contained in this e-mail message may be privileged
and/or confidential.  It is for intended addressee(s) only.  If you are
not the intended recipient, you are hereby notified that any disclosure,
reproduction, distribution or other use of this communication is
strictly prohibited and could, in certain circumstances, be a criminal
offense.  If you have received this e-mail in error, please notify the
sender by reply and delete this message without copying or disclosing
it. 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-24 Thread Edward Jaffe

McKown, John wrote:

I don't think you have a choice, in the general case. That is because
all the new TRxx type instructions seem to terminate when the data in
your buffer equals to the contents of the low order byte general
register 0. I.e. they stop at an end of buffer type character, like a
null in a C string. If you can tolerate this behaviour, then I'd look
at the TRE or TROO instruction. The TRE seems easier to use, to me.
  


The following fragment should work if you prefer looping TRE over 
traditional TR. TRE requires you to manually translate the so-called 
stop character with an MVC. But, at least there's no EXecute for the 
final segment.


   LM   R14,R15,xx   Load string ptr and its length
   LA   R1,xxPtr to translation table
   XR   R0,R0Set stop char = x'00'
   DO INFDo for translate
 TRE   R14,R1  Translate the string
 DOEXIT Z  Exit if no more data
 IF O  If iterate needed
   ITERATE , Process another segment
 ENDIF ,   EndIf
 MVC   0(1,R14),0(R1)  Translate x'00' to whatever
 LAR14,1(,R14) Advance past stop character
 AHI   R15,-1 Decrement length remaining
 DOEXIT NPExit if no more data
   ENDDO ,   EndDo for translate

--
Edward E Jaffe
Phoenix Software International, Inc
5200 W Century Blvd, Suite 800
Los Angeles, CA 90045
310-338-0400 x318
[EMAIL PROTECTED]
http://www.phoenixsoftware.com/

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Long translate (TR) instruction?

2008-03-24 Thread Kirk Wolf
Thanks everyone for all the advice.

Kirk

On Mon, Mar 24, 2008 at 2:42 PM, Edward Jaffe [EMAIL PROTECTED]
wrote:

 McKown, John wrote:
  I don't think you have a choice, in the general case. That is because
  all the new TRxx type instructions seem to terminate when the data in
  your buffer equals to the contents of the low order byte general
  register 0. I.e. they stop at an end of buffer type character, like a
  null in a C string. If you can tolerate this behaviour, then I'd look
  at the TRE or TROO instruction. The TRE seems easier to use, to me.
 

 The following fragment should work if you prefer looping TRE over
 traditional TR. TRE requires you to manually translate the so-called
 stop character with an MVC. But, at least there's no EXecute for the
 final segment.

LM   R14,R15,xx   Load string ptr and its length
LA   R1,xxPtr to translation table
XR   R0,R0Set stop char = x'00'
DO INFDo for translate
  TRE   R14,R1  Translate the string
  DOEXIT Z  Exit if no more data
  IF O  If iterate needed
ITERATE , Process another segment
  ENDIF ,   EndIf
  MVC   0(1,R14),0(R1)  Translate x'00' to whatever
  LAR14,1(,R14) Advance past stop character
  AHI   R15,-1 Decrement length remaining
  DOEXIT NPExit if no more data
ENDDO ,   EndDo for translate

 --
 Edward E Jaffe
 Phoenix Software International, Inc
 5200 W Century Blvd, Suite 800
 Los Angeles, CA 90045
 310-338-0400 x318
 [EMAIL PROTECTED]
 http://www.phoenixsoftware.com/

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html