William: Thanks (again). I found these results most interesting.
Art At 11:18 PM 3/27/2008, William H. Blair wrote: >Art Celestini wrote: > >> I'm convinced that TRE and TR are faster but it seems that >> a truly "fair" comparison of solutions to the stated problem >> should have included equivalent moves in the TRE and TR >> solutions. > >I did write and run versions with the code like that. And, I >said so: > >| I've got code that needs to translate stuff >| in a buffer and it does not need it moved. And I have other >| code that first moves it and then translates it, because it >| doesn't want to clobber what it's translating. But, I did it >| both ways, just to find out for sure if it made a difference. >| It does not. > >But since you asked, I added those into the mix, so you can >see and judge for yourself: > > TIME (IN SECONDS) FOR 001,000,000 REPETITIONS OF: > ------------------------------------------------- >--BYTES- NO TR(E) TRE INPL TRE MVC TR INPL TR MVC >======== --------- --------- --------- --------- --------- >00000800 14.939655 1.245189 1.642310 1.082476 1.236875 >00000400 7.162529 0.731567 0.971124 0.487941 0.580783 >00000200 3.593004 0.461754 0.673962 0.206117 0.268123 >00000100 1.802772 0.253433 0.342846 0.032038 0.050725 >000000C0 1.355390 0.240958 0.311724 0.031969 0.048488 >00000080 0.909253 0.210573 0.276103 0.031942 0.046119 >00000040 0.463195 0.150320 0.164585 0.032047 0.043604 >00000020 0.238923 0.101492 0.113927 0.032032 0.042417 >0000001E 0.225827 0.111231 0.122245 0.032019 0.042544 >0000001C 0.210944 0.110432 0.122021 0.031966 0.042432 >0000001A 0.197080 0.110823 0.122119 0.031953 0.042508 >00000018 0.183400 0.104318 0.116599 0.031982 0.042673 >00000016 0.169207 0.099349 0.110853 0.031980 0.042465 >00000014 0.155477 0.100393 0.109962 0.032081 0.042704 >00000012 0.141733 0.099860 0.111362 0.031961 0.042495 >00000010 0.127308 0.070471 0.083389 0.031962 0.041866 >0000000E 0.113336 0.074843 0.086993 0.031981 0.041867 >0000000C 0.099318 0.073958 0.086677 0.031962 0.041833 >0000000A 0.085462 0.074848 0.086733 0.032057 0.041985 >00000008 0.071609 0.069932 0.081476 0.030228 0.038990 >00000007 0.064623 0.058755 0.068647 0.030245 0.039025 >00000006 0.057541 0.058729 0.068720 0.030278 0.038971 >00000005 0.050582 0.058701 0.068568 0.030230 0.038931 >00000004 0.043603 0.058764 0.068620 0.030246 0.039029 >00000003 0.036664 0.058748 0.068683 0.030220 0.038934 >00000002 0.029665 0.058824 0.068732 0.030386 0.039100 >00000001 0.022716 0.059113 0.069109 0.029829 0.038662 >00000000 0.005250 0.016894 0.005825 0.005239 0.005835 > >TESTNAME DESCRIPTION >-------- -------------------------------------------- >NO TR(E) Basic move and translate, one byte at a time >TRE INPL TRE loop in-place >TRE MVC TRE loop buffer-to-buffer move first >TR INPL TR loop in-place >TR MVC TR loop buffer-to-buffer move first > >TR is always faster than TRE. Having to move the data >from an input buffer to a separate output buffer for >translation increases the CPU time required by ~15%. > >That is still way less than the overhead of the basic >move and translate, which is the fastest technique >only for 0, 1, 2, or 3 bytes (for more than 3 bytes, >the basic TR loop, or even the TR loop with the data >to be translated having to be moved to the output >buffer first, is fastest). > >The above figures include the "equivalent moves" to make >it a 'truly "fair" comparison of solutions to the stated >problem'. It reflects what I have always observed about >such tests: a well-coded, basic, tight MVC loop (or an >MVCL) is pretty fast compared to almost anything else >that involves a half-dozen or so instructions that do >virtually anything. Thus, counting the CPU time that is >required to move the data to a separate buffer as part >of the overhead doesn't actually add that much to the >CPU time required to get the whole job done. > >I suspect that this is simply due to the fact that MVC >and MVCL are already pretty well-optimized for the job >they do. Even a basic, tight loop will be limited by >some performance constraint, probably by the rate at >which instructions whose execution cannot be overlapped >can be pumped through the machine (in contrast to blobs >of data MVCing and TR[T]ing thru the wires all as part >of one instruction). > >Today, for all intents and purposes, the time required >to execute any given standard instruction is the same >as any other. This is because the work to be done can be >done in the available time, before another instruction >is fetched and shoved through the internal machinery. >The instructions which process more than a word or two >of data take longer, of course. But some of those are >very highly optimized (in hardware -- for example, the >LM and STM instructions are no longer pigs. They are, >in fact, fairly effective substitutes for MVC, except >that you toast the contents of several registers when >you use enough to make it worthwhile. > >Thus, optimization in our world today, except for the >scientific and numerical programmer, has become a job >of simply cutting down the number of instructions one >executes ... that is, simply cutting the path length. > >And that is the simple reason why a basic TR loop, >even with an MVC/MVCL (just) ahead of it, is still >the quickest way to get the job done, because fewer >instructions overall are executed. > ================================================== Art Celestini Celestini Development Services Phone: 201-670-1674 Wyckoff, NJ ============= http://celestini.com ============= Mail sent to the "From" address used in this post will be rejected by our server. Please send off- list email to: ibmmain<at-sign>celestini<dot>com. ================================================== ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html