Re: Long translate (TR) instruction?

William H. Blair Thu, 27 Mar 2008 20:19:58 -0700

Art Celestini wrote:

> I'm convinced that TRE and TR are faster but it seems that 
> a truly "fair" comparison of solutions to the stated problem 
> should have included equivalent moves in the TRE and TR 
> solutions.


I did write and run versions with the code like that. And, I 
said so:

|                   I've got code that needs to translate stuff
| in a buffer and it does not need it moved. And I have other
| code that first moves it and then translates it, because it
| doesn't want to clobber what it's translating. But, I did it
| both ways, just to find out for sure if it made a difference.
| It does not.

But since you asked, I added those into the mix, so you can 
see and judge for yourself:

         TIME (IN SECONDS) FOR 001,000,000 REPETITIONS OF:
         -------------------------------------------------        
--BYTES-  NO TR(E)  TRE INPL  TRE MVC   TR  INPL  TR  MVC  
======== --------- --------- --------- --------- --------- 
00000800 14.939655  1.245189  1.642310  1.082476  1.236875 
00000400  7.162529  0.731567  0.971124  0.487941  0.580783 
00000200  3.593004  0.461754  0.673962  0.206117  0.268123 
00000100  1.802772  0.253433  0.342846  0.032038  0.050725 
000000C0  1.355390  0.240958  0.311724  0.031969  0.048488 
00000080  0.909253  0.210573  0.276103  0.031942  0.046119 
00000040  0.463195  0.150320  0.164585  0.032047  0.043604 
00000020  0.238923  0.101492  0.113927  0.032032  0.042417 
0000001E  0.225827  0.111231  0.122245  0.032019  0.042544 
0000001C  0.210944  0.110432  0.122021  0.031966  0.042432 
0000001A  0.197080  0.110823  0.122119  0.031953  0.042508 
00000018  0.183400  0.104318  0.116599  0.031982  0.042673 
00000016  0.169207  0.099349  0.110853  0.031980  0.042465 
00000014  0.155477  0.100393  0.109962  0.032081  0.042704 
00000012  0.141733  0.099860  0.111362  0.031961  0.042495 
00000010  0.127308  0.070471  0.083389  0.031962  0.041866 
0000000E  0.113336  0.074843  0.086993  0.031981  0.041867 
0000000C  0.099318  0.073958  0.086677  0.031962  0.041833 
0000000A  0.085462  0.074848  0.086733  0.032057  0.041985 
00000008  0.071609  0.069932  0.081476  0.030228  0.038990 
00000007  0.064623  0.058755  0.068647  0.030245  0.039025 
00000006  0.057541  0.058729  0.068720  0.030278  0.038971 
00000005  0.050582  0.058701  0.068568  0.030230  0.038931 
00000004  0.043603  0.058764  0.068620  0.030246  0.039029 
00000003  0.036664  0.058748  0.068683  0.030220  0.038934 
00000002  0.029665  0.058824  0.068732  0.030386  0.039100 
00000001  0.022716  0.059113  0.069109  0.029829  0.038662 
00000000  0.005250  0.016894  0.005825  0.005239  0.005835 

TESTNAME  DESCRIPTION
--------  --------------------------------------------
NO TR(E)  Basic move and translate, one byte at a time
TRE INPL  TRE loop in-place
TRE MVC   TRE loop buffer-to-buffer move first
TR  INPL  TR  loop in-place
TR  MVC   TR  loop buffer-to-buffer move first  

TR is always faster than TRE. Having to move the data
from an input buffer to a separate output buffer for
translation increases the CPU time required by ~15%.

That is still way less than the overhead of the basic
move and translate, which is the fastest technique 
only for 0, 1, 2, or 3 bytes (for more than 3 bytes,
the basic TR loop, or even the TR loop with the data
to be translated having to be moved to the output 
buffer first, is fastest).

The above figures include the "equivalent moves" to make 
it a 'truly "fair" comparison of solutions to the stated 
problem'. It reflects what I have always observed about
such tests: a well-coded, basic, tight MVC loop (or an
MVCL) is pretty fast compared to almost anything else
that involves a half-dozen or so instructions that do
virtually anything. Thus, counting the CPU time that is
required to move the data to a separate buffer as part
of the overhead doesn't actually add that much to the
CPU time required to get the whole job done. 

I suspect that this is simply due to the fact that MVC
and MVCL are already pretty well-optimized for the job
they do. Even a basic, tight loop will be limited by 
some performance constraint, probably by the rate at
which instructions whose execution cannot be overlapped
can be pumped through the machine (in contrast to blobs 
of data MVCing and TR[T]ing thru the wires all as part 
of one instruction).

Today, for all intents and purposes, the time required 
to execute any given standard instruction is the same 
as any other. This is because the work to be done can be
done in the available time, before another instruction
is fetched and shoved through the internal machinery.
The instructions which process more than a word or two 
of data take longer, of course. But some of those are
very highly optimized (in hardware -- for example, the
LM and STM instructions are no longer pigs. They are,
in fact, fairly effective substitutes for MVC, except
that you toast the contents of several registers when
you use enough to make it worthwhile.  

Thus, optimization in our world today, except for the
scientific and numerical programmer, has become a job
of simply cutting down the number of instructions one
executes ... that is, simply cutting the path length.

And that is the simple reason why a basic TR loop,
even with an MVC/MVCL (just) ahead of it, is still 
the quickest way to get the job done, because fewer
instructions overall are executed.

-- 
WB

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: Long translate (TR) instruction?

Reply via email to