Re: Comparison of compiler generated code AD 1980(ish) v 2010(ish)

Miklos Szigetvari Wed, 16 May 2012 08:21:37 -0700

Hi

Do you have the chance to compare the speed of the two codes ?



> David,
>
> On 2012-05-16 08:23, David Crayford wrote:
>  > Robert,
>  >
>  > I'm no expert but I have read that newer hardware models (Z10 and
> above) are
>  > essentially RISC processors that run complex instructions in millicode.
> In the
>
> I may be wrong, but I think the z196 is the first OOO machine and
> Enterprise PL/I V3R9 pre-dates it
> by two years.
>
>  > case of a MVC instruction it would have to do that in a loop which
> would require
>  > branching, the enemy of pipelined exeuction units. It's also possible
> to run
>  > simple instructions
>  > in parallel. It's plausible an MVC instruction can be executed more
> efficiently
>  > as a sequence of LG/STG instructions.
>
> Given that moves are the most executed instructions, at least on x86,
> (see, among many others
> <www.ijpg.org/index.php/IJACSci/article/download/118/29>) and I have
> little doubt that the same
> holds true for about any other architecture and that there is special x86
> circuitry to optimize MOVS
> instructions, it would be highly surprising if IBM did not make MVC as
> fast as possible, millicoded
> or not.
>
>  > The OOO decode units do this for you with instruction cracking on a
> z196, it
>  > seems that on a z10 the optimizer is doing the same thing.
>
> Possibly, but that does not explain the 10 superfluous reloads of r1.
>
>  > See this document - page 21
>  > 
> http://www-01.ibm.com/software/htp/tpf/tpfug/tgf11/How_do_you_do_when_youre_a_z196_CPU.pdf
>  >
>  > Optimizers create arcane code. It's almost impossible to verify without
>  > understanding the secret sauce. A lot of the code the optimizers spit
> out is
>  > intractable,
>
> I don't know much about z/OS assembler, but at least I sort of managed to
> understand the code
> generated by the OS PL/I compiler. The code generated by Enterprise PL/I
> is completely unreadable,
> even some (or more than some) on this list might have trouble figuring out
> why it does what it does.
>
>  > and it's almost a paradox that a longer code path produces faster code.
>  >
>  > If you don't like it you can always compile at a different ARCH() level
> and ask
>  > IBM.
>
> Going back to ARCH(5) doesn't produce anything that seems much shorter,
> still the ridiculous
> reloading of the same register, and oodles and oodles instructions which
> would run and take time on
> a definitely not-OOO CPU:
>
> 003A58  E300  8238  0014  003119 | LGF   r0,LINE_PTR(,r8,568)
> 003A5E  4110  E00C        003119 | LA    r1,_shadow21(,r14,12)
> 003A62  B914  00E0        003119 | LGFR  r14,r0
> 003A66  D278  B38E  6D33  003118 | MVC
> LINE(121,r11,910),REPT_INIT(r6,3379)
> 003A6C  E3B0  DC20  0004  003119 | LG    r11,#SPILL17(,r13,3104)
> 003A72  50B0  D25C        003119 | ST    r11,_temp9(,r13,604)
> 003A76  DE03  D25C  1000  003119 | ED    _temp9(4,r13,604),_shadow21(r1,0)
> 003A7C  4110  E003        003119 | LA    r1,#AddressShadow(,r14,3)
> 003A80  41F0  E00A        003119 | LA    r15,#AddressShadow(,r14,10)
> 003A84  D202  1001  D25D  003119 | MVC   _shadow21(3,r1,1),_temp9(r13,605)
> 003A8A  9240  E003        003119 | MVI   _shadow21(r14,3),64
> 003A8E  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003A92  50B0  D2E4        003119 | ST    r11,_temp8(,r13,740)
> 003A96  41B0  E017        003119 | LA    r11,#AddressShadow(,r14,23)
> 003A9A  4110  100E        003119 | LA    r1,_shadow21(,r1,14)
> 003A9E  DE03  D2E4  1000  003119 | ED    _temp8(4,r13,740),_shadow21(r1,0)
> 003AA4  D202  F001  D2E5  003119 | MVC
> _shadow21(3,r15,1),_temp8(r13,741)
> 003AAA  9240  E00A        003119 | MVI   _shadow21(r14,10),64
> 003AAE  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003AB2  E3F0  DB98  0004  003119 | LG    r15,#SPILL0(,r13,2968)
> 003AB8  D202  E011  1010  003119 | MVC
> _shadow21(3,r14,17),_shadow21(r1,16)
> 003ABE  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003AC2  D206  D2D4  F4A4  003119 | MVC   _temp19(7,r13,724),'
> ......'(r15,1188)
> 003AC8  D203  D26C  1013  003119 | MVC
> _temp15(4,r13,620),_shadow18(r1,19)
> 003ACE  4110  D26C        003119 | LA    r1,_temp15(,r13,620)
> 003AD2  D202  D24C  1001  003119 | MVC
> _temp11(3,r13,588),_shadow12(r1,1)
> 003AD8  4110  D24C        003119 | LA    r1,_temp11(,r13,588)
> 003ADC  DE06  D2D4  1000  003119 | ED    _temp19(7,r13,724),_temp11(r1,0)
> 003AE2  D205  B000  D2D5  003119 | MVC
> _shadow21(6,r11,0),_temp19(r13,725)
> 003AE8  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003AEC  D206  D2CC  F4A4  003119 | MVC   _temp21(7,r13,716),'
> ......'(r15,1188)
> 003AF2  D202  D249  101B  003119 | MVC
> _temp18(3,r13,585),_shadow12(r1,27)
> 003AF8  D202  D246  D249  003119 | MVC
> _temp20(3,r13,582),_temp18(r13,585)
> 003AFE  4110  E028        003119 | LA    r1,#AddressShadow(,r14,40)
> 003B02  E300  D246  0090  003119 | LLGC  r0,<a1:d582:l1>(,r13,582)
> 003B08  E300  3114  0080  003119 | NG    r0,=X'00000000 0000000F'
> 003B0E  41B0  D246        003119 | LA    r11,_temp20(,r13,582)
> 003B12  4200  D246        003119 | STC   r0,<a1:d582:l1>(,r13,582)
> 003B16  DE06  D2CC  B000  003119 | ED    _temp21(7,r13,716),_temp20(r11,0)
> 003B1C  D204  1000  D2CE  003119 | MVC
> _shadow21(5,r1,0),_temp21(r13,718)
> 003B22  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B26  E300  1026  0014  003119 | LGF   r0,_shadow19(,r1,38)
> 003B2C  5000  E030        003119 | ST    r0,_shadow19(,r14,48)
> 003B30  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B34  E300  102A  0014  003119 | LGF   r0,_shadow19(,r1,42)
> 003B3A  5000  E036        003119 | ST    r0,_shadow19(,r14,54)
> 003B3E  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B42  E300  102E  0014  003119 | LGF   r0,_shadow19(,r1,46)
> 003B48  5000  E03D        003119 | ST    r0,_shadow19(,r14,61)
> 003B4C  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B50  4300  1036        003119 | IC    r0,_shadow21(,r1,54)
> 003B54  4200  E04B        003119 | STC   r0,_shadow21(,r14,75)
> 003B58  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B5C  E300  1043  0014  003119 | LGF   r0,_shadow19(,r1,67)
> 003B62  5000  E05F        003119 | ST    r0,_shadow19(,r14,95)
> 003B66  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B6A  4800  1047        003119 | LH    r0,_shadow20(,r1,71)
> 003B6E  B914  0000        003119 | LGFR  r0,r0
> 003B72  4000  E064        003119 | STH   r0,_shadow20(,r14,100)
> 003B76  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B7A  4800  1049        003119 | LH    r0,_shadow20(,r1,73)
> 003B7E  B914  0000        003119 | LGFR  r0,r0
> 003B82  4000  E067        003119 | STH   r0,_shadow20(,r14,103)
> 003B6E  B914  0000        003119 | LGFR  r0,r0
> 003B72  4000  E064        003119 | STH   r0,_shadow20(,r14,100)
> 003B76  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
> 003B7A  4800  1049        003119 | LH    r0,_shadow20(,r1,73)
> 003B7E  B914  0000        003119 | LGFR  r0,r0
> 003B82  4000  E067        003119 | STH   r0,_shadow20(,r14,103)
>
> When I started with PL/I in 1985, we were told never to initialize a
> structure multiple times with
> '', but to do it once, copy the initialized structure to a copy and
> re-initialize it with this copy,
> as the compiler would just generate a simple MVC. For arrays of structures
> the multi-init using ''
> was even worse, but by doing a
>
> array_of_structure(1) = ''
>
> followed by
>
> array_of_structure = array_of_structure(1)
>
> The code was near optimal (and by having an initialized STATIC copy of the
> structure at hand and
> using that), the code was for all intents and purposes optimal. Not so
> with Enterprise PL/I although
> I believe, but lacking access to EPLI 4.1 & 4.2, some issues have been
> addressed. Here's an example
> of OS/ PL/I V2.3.0 versus Enterprise PL/I V3R9M0:
>
> dcl 1 rept_line(10),
>        2 z00001             char       (3),
>        2 tr                 pic     'zzz9',
>        2 z00002             char       (3),
>        2 ri                 pic     'zzz9',
>        2 z00003             char       (3),
>        2 da                 char       (3),
>        2 z00004             char       (3),
>        2 km                 pic '(3)z9v.9',
>        2 z00005             char       (3),
>        2 hh                 pic      'z9.',
>        2 mm                 pic       '99',
>        2 z00006             char       (3),
>        2 v                  pic   'zz9v.9',
>        2 z00007             char       (3),
>        2 na                 char       (4),
>        2 z00008             char       (2),
>        2 ty                 char       (4),
>        2 z00009             char       (3),
>        2 co                 char       (4),
>        2 z00010             char       (2),
>        2 wa,
>          3 whh              pic      'z9.',
>          3 wmm              pic       '99',
>        2 z00011             char       (3),
>        2 sp                 char       (1),
>        2 z00012             char     (255),
>        2 de,
>          3 dhh              pic      'z9.',
>          3 dmm              pic       '99',
>        2 z00013             char       (3),
>        2 ar,
>          3 ahh              pic      'z9.',
>          3 amm              pic       '99',
>        2 z00014             char       (3),
>        2 date,
>          3 year             pic     '9999',
>          3 z00015           char       (1),
>          3 month            pic       '99',
>          3 z00016           char       (1),
>          3 day              pic       '99';
>
> /* Just fill the first element with something total random */
> /* Q&D, too long, not good, but that's safe in PL/I        */
> string(rept_line(1)) = '.z,dmbvn;aehj,mzncbkmsdlkjsjsndvfkl\hjsb' ||
>                         'fjhbc.blkwaioyuh.m,jdnsvkjxbvhjbzdfwtytk' ||
>                         'vkjbnsegfirahjgouegkjnzkjgh8eryghkjghxjv' ||
>                         'uye9tkjgkjvuhkjzxng-oipu8ynkjh4268srtjsc' ||
>                         'uhkdlgozdugjnrg;hzdfgi.zdlnhg;zfhjgiozdh' ||
>                         'iorjhdhgjzndg;hzdohgjdrhgjiozd-862jhaso9' ||
>                         'fhhgoishiojsdrhjdiuhz,dmbvn;aehj,mzncbkk' ||
>                         'l59uyjhlkxjbxofyixhjjhbc.blkwaioyuh.m,j0' ||
>                         'ftkjgkjvuhkjzxng-oipkjbnsegfirahjgouegkk' ||
>                         'llgozdugjnrg;hzdfgi.mbx/hjjxfhj(*^^^%$?0';
> rept_line = rept_line(1);
>
> The code generated by OS PL/I V2.3.0 - OPT(2):
>
> * STATEMENT NUMBER  15
> 0000E0  41 E0 D 0D0            LA    14,REPT_LINE.Z00001+357
> 0000E4  50 E0 D 0C8            ST    14,200(0,13)
> 0000E8  41 70 D 0C8            LA    7,200(0,13)
> 0000EC  50 70 3 57C            ST    7,1404(0,3)
> 0000F0  41 10 3 57C            LA    1,1404(0,3)
> 0000F4  58 F0 3 020            L     15,A..IBMBAPMA
> 0000F8  05 EF                  BALR  14,15
>
> Not very nice, a call to the library, but once in a program? We have to
> live with it if we choose
> this kind of initialization.
>
> * STATEMENT NUMBER  16
> 0000FA  41 90 D 0D0            LA    9,REPT_LINE.Z00001+357
> 0000FE  41 80 0 00C            LA    8,12(0,0)
> 000102  41 70 D 235            LA    7,REPT_LINE.Z00001+714
> 000106                    CL.7 EQU   *
> 000106  D2 FF 7 000 9 000      MVC   0(256,7),0(9)
> 00010C  41 70 7 100            LA    7,256(0,7)
> 000110  41 90 9 100            LA    9,256(0,9)
> 000114  46 80 2 02A            BCT   8,CL.7
> 000118  D2 8C 7 000 9 000      MVC   0(141,7),0(9)
>
> Inner loop: MVC, 2 x LA and BCT
>
> The code generated by Enterprise PL/I V3R9 OPT(3), ARCH(9) - 108 is
> statement 15 above, 119 is 17:
>
> 00008C  4110  D0CC        108 |      LA    r1,REPT_LINE(,r13,204)
> 000090  A709  0001        119 |      LGHI  r0,H'1'
>
> Optimizer seemed to have moved other code here...
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> 000094  5020  2010        002 |      ST    r2,<s43:d16:l4>(,r2,16)
> 000098  9231  D0C1        059 |      MVI   _Sfi(r13,193),49
> 00009C  E3F0  3004  0014  059 |      LGF
> r15,=A(_ON_Begin_60_Blk_2)(,r3,4)
> 0000A2  50D0  DF04        059 |      ST    r13,<a1:d3844:l4>(,r13,3844)
> 0000A6  50F0  DF00        059 |      ST    r15,<a1:d3840:l4>(,r13,3840)
> 0000AA  E3E0  DF00  0004  059 |      LG    r14,_temp1(,r13,3840)
> 0000B0  E3E0  D0C4  0024  059 |      STG   r14,_Sfi(,r13,196)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> 0000B6  C0E0  0000  00E9  108 |      LARL  r14,F'233'
> 0000BC  D2FF  1000  E1E8  108 |      MVC
> _...(256,r1,0),'.z,...'(r14,488)
> 0000C2  D264  1100  E2E8  108 |      MVC
> _shad...(101,r1,256),'.z,...'(r14,744)
>
> 0000C8                    119 | @1L2 DS    0H
> 0000C8  EBE0  0009  000D  119 |      SLLG  r14,r0,9      r14 = 512 x r0
> 0000CE  EB10  0007  000D  119 |      SLLG  r1,r0,7       r1  = 128 x r0
> 0000D4  EBF0  0005  000D  119 |      SLLG  r15,r0,5      r15 =  32 x r0
> 0000DA  1FE1              119 |      SLR   r14,r1        r14 = 384 x r0
> 0000DC  EB40  0002  000D  119 |      SLLG  r4,r0,2       r4  =   4 x r0
> 0000E2  1FEF              119 |      SLR   r14,r15       r14 = 352 x r0
> 0000E4  B904  0010        119 |      LGR   r1,r0         r1  = r0
> 0000E8  1EE4              119 |      ALR   r14,r4        r14 = 356 x r0
> 0000EA  A70A  0001        119 |      AHI   r0,H'1'
> 0000EE  1E1E              119 |      ALR   r1,r14        r1  = 357 x r0
> 0000F0  E311  DF67  FF71  119 |      LAY   r1,REPT_LINE(r1,r13,-153)
> 0000F6  D2FF  1000  D0CC  119 |      MVC
> REPT_LINE(256,r1,0),REPT_LINE(r13,204)
> 0000FC  D264  1100  D1CC  119 |      MVC
> REPT_LINE(101,r1,256),REPT_LINE(r13,460)
> 000102  EC0C  FFE3  0A7E  119 |      CIJNH r0,H'10',@1L2
>
> Inner loop: WTH! For crying out loud... Is this really a "fast" multiply
> by 357??? And why waste
> three extra registers on it??? Oh yes, because the instructions overlap...
>
> The equivalent inner loop using ARCH(5), the lowest possible by EPLI V3R9:
>
> 0000D8                    000119 | @1L2 DS  0H
> 0000D8  B904  00E0        000119 |      LGR r14,r0
> 0000DC  A7EC  0165        000119 |      MHI r14,H'357'
> 0000E0  A70A  0001        000119 |      AHI r0,H'1'
> 0000E4  A70E  000A        000119 |      CHI r0,H'10'
> 0000E8  41EE  FE9A        000119 |      LA  r14,REPT_LINE(r14,r15,3738)
> 0000EC  D2FF  E000  1000  000119 |      MVC
> REPT_LINE(256,r14,0),REPT_LINE(r1,0)
> 0000F2  D264  E100  1100  000119 |      MVC
> REPT_LINE(101,r14,256),REPT_LINE(r1,256)
> 0000F8  A7D4  FFF0        000119 |      JNH @1L2
>
> OK, it contains a multiply, a "slow" instruction, be it that it can be
> made pretty fast if you look
> at the x86 offerings from AMD & Intel (Sandy Bridge: 64 bit mul in 3
> cycles). However, given that
> this is a normal non-interleaved array, why do you need a multiplication
> at all. The V2.3.0 compiler
> clearly demonstrated that you don't, and did so almost three decades
> ago!!!
>
> Again, I just observe, your boss picks up the bill for the CPU cycles
> used...
>
> If your company is paying thousands of dollars per year to be able to use
> Enterprise PL/I, don't you
> think you are entitled to a compiler that generates the best possible
> code?
>
> Robert
> --
> Robert AH Prins
> robert(a)prino(d)org
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to [email protected] with the message: INFO IBM-MAIN
>
>

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Comparison of compiler generated code AD 1980(ish) v 2010(ish)

Reply via email to