Re: Comparison of compiler generated code AD 1980(ish) v 2010(ish)

Robert Prins Wed, 16 May 2012 08:02:37 -0700

David,

On 2012-05-16 08:23, David Crayford wrote:
> Robert,
>
> I'm no expert but I have read that newer hardware models (Z10 and above) are
> essentially RISC processors that run complex instructions in millicode. In the

I may be wrong, but I think the z196 is the first OOO machine and Enterprise PL/I V3R9 pre-dates itby two years.


> case of a MVC instruction it would have to do that in a loop which would 
require
> branching, the enemy of pipelined exeuction units. It's also possible to run
> simple instructions
> in parallel. It's plausible an MVC instruction can be executed more 
efficiently
> as a sequence of LG/STG instructions.

Given that moves are the most executed instructions, at least on x86, (see, among many others<www.ijpg.org/index.php/IJACSci/article/download/118/29>) and I have little doubt that the sameholds true for about any other architecture and that there is special x86 circuitry to optimize MOVSinstructions, it would be highly surprising if IBM did not make MVC as fast as possible, millicodedor not.


> The OOO decode units do this for you with instruction cracking on a z196, it
> seems that on a z10 the optimizer is doing the same thing.

Possibly, but that does not explain the 10 superfluous reloads of r1.

> See this document - page 21
> 
http://www-01.ibm.com/software/htp/tpf/tpfug/tgf11/How_do_you_do_when_youre_a_z196_CPU.pdf
>
> Optimizers create arcane code. It's almost impossible to verify without
> understanding the secret sauce. A lot of the code the optimizers spit out is
> intractable,

I don't know much about z/OS assembler, but at least I sort of managed to understand the codegenerated by the OS PL/I compiler. The code generated by Enterprise PL/I is completely unreadable,even some (or more than some) on this list might have trouble figuring out why it does what it does.


> and it's almost a paradox that a longer code path produces faster code.
>
> If you don't like it you can always compile at a different ARCH() level and 
ask
> IBM.

Going back to ARCH(5) doesn't produce anything that seems much shorter, still the ridiculousreloading of the same register, and oodles and oodles instructions which would run and take time ona definitely not-OOO CPU:


003A58  E300  8238  0014  003119 | LGF   r0,LINE_PTR(,r8,568)
003A5E  4110  E00C        003119 | LA    r1,_shadow21(,r14,12)
003A62  B914  00E0        003119 | LGFR  r14,r0
003A66  D278  B38E  6D33  003118 | MVC   LINE(121,r11,910),REPT_INIT(r6,3379)
003A6C  E3B0  DC20  0004  003119 | LG    r11,#SPILL17(,r13,3104)
003A72  50B0  D25C        003119 | ST    r11,_temp9(,r13,604)
003A76  DE03  D25C  1000  003119 | ED    _temp9(4,r13,604),_shadow21(r1,0)
003A7C  4110  E003        003119 | LA    r1,#AddressShadow(,r14,3)
003A80  41F0  E00A        003119 | LA    r15,#AddressShadow(,r14,10)
003A84  D202  1001  D25D  003119 | MVC   _shadow21(3,r1,1),_temp9(r13,605)
003A8A  9240  E003        003119 | MVI   _shadow21(r14,3),64
003A8E  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003A92  50B0  D2E4        003119 | ST    r11,_temp8(,r13,740)
003A96  41B0  E017        003119 | LA    r11,#AddressShadow(,r14,23)
003A9A  4110  100E        003119 | LA    r1,_shadow21(,r1,14)
003A9E  DE03  D2E4  1000  003119 | ED    _temp8(4,r13,740),_shadow21(r1,0)
003AA4  D202  F001  D2E5  003119 | MVC   _shadow21(3,r15,1),_temp8(r13,741)
003AAA  9240  E00A        003119 | MVI   _shadow21(r14,10),64
003AAE  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003AB2  E3F0  DB98  0004  003119 | LG    r15,#SPILL0(,r13,2968)
003AB8  D202  E011  1010  003119 | MVC   _shadow21(3,r14,17),_shadow21(r1,16)
003ABE  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003AC2  D206  D2D4  F4A4  003119 | MVC   _temp19(7,r13,724),' ......'(r15,1188)
003AC8  D203  D26C  1013  003119 | MVC   _temp15(4,r13,620),_shadow18(r1,19)
003ACE  4110  D26C        003119 | LA    r1,_temp15(,r13,620)
003AD2  D202  D24C  1001  003119 | MVC   _temp11(3,r13,588),_shadow12(r1,1)
003AD8  4110  D24C        003119 | LA    r1,_temp11(,r13,588)
003ADC  DE06  D2D4  1000  003119 | ED    _temp19(7,r13,724),_temp11(r1,0)
003AE2  D205  B000  D2D5  003119 | MVC   _shadow21(6,r11,0),_temp19(r13,725)
003AE8  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003AEC  D206  D2CC  F4A4  003119 | MVC   _temp21(7,r13,716),' ......'(r15,1188)
003AF2  D202  D249  101B  003119 | MVC   _temp18(3,r13,585),_shadow12(r1,27)
003AF8  D202  D246  D249  003119 | MVC   _temp20(3,r13,582),_temp18(r13,585)
003AFE  4110  E028        003119 | LA    r1,#AddressShadow(,r14,40)
003B02  E300  D246  0090  003119 | LLGC  r0,<a1:d582:l1>(,r13,582)
003B08  E300  3114  0080  003119 | NG    r0,=X'00000000 0000000F'
003B0E  41B0  D246        003119 | LA    r11,_temp20(,r13,582)
003B12  4200  D246        003119 | STC   r0,<a1:d582:l1>(,r13,582)
003B16  DE06  D2CC  B000  003119 | ED    _temp21(7,r13,716),_temp20(r11,0)
003B1C  D204  1000  D2CE  003119 | MVC   _shadow21(5,r1,0),_temp21(r13,718)
003B22  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B26  E300  1026  0014  003119 | LGF   r0,_shadow19(,r1,38)
003B2C  5000  E030        003119 | ST    r0,_shadow19(,r14,48)
003B30  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B34  E300  102A  0014  003119 | LGF   r0,_shadow19(,r1,42)
003B3A  5000  E036        003119 | ST    r0,_shadow19(,r14,54)
003B3E  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B42  E300  102E  0014  003119 | LGF   r0,_shadow19(,r1,46)
003B48  5000  E03D        003119 | ST    r0,_shadow19(,r14,61)
003B4C  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B50  4300  1036        003119 | IC    r0,_shadow21(,r1,54)
003B54  4200  E04B        003119 | STC   r0,_shadow21(,r14,75)
003B58  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B5C  E300  1043  0014  003119 | LGF   r0,_shadow19(,r1,67)
003B62  5000  E05F        003119 | ST    r0,_shadow19(,r14,95)
003B66  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B6A  4800  1047        003119 | LH    r0,_shadow20(,r1,71)
003B6E  B914  0000        003119 | LGFR  r0,r0
003B72  4000  E064        003119 | STH   r0,_shadow20(,r14,100)
003B76  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B7A  4800  1049        003119 | LH    r0,_shadow20(,r1,73)
003B7E  B914  0000        003119 | LGFR  r0,r0
003B82  4000  E067        003119 | STH   r0,_shadow20(,r14,103)
003B6E  B914  0000        003119 | LGFR  r0,r0
003B72  4000  E064        003119 | STH   r0,_shadow20(,r14,100)
003B76  5810  8000        003119 | L     r1,REPT_PTR(,r8,0)
003B7A  4800  1049        003119 | LH    r0,_shadow20(,r1,73)
003B7E  B914  0000        003119 | LGFR  r0,r0
003B82  4000  E067        003119 | STH   r0,_shadow20(,r14,103)

When I started with PL/I in 1985, we were told never to initialize a structure multiple times with'', but to do it once, copy the initialized structure to a copy and re-initialize it with this copy,as the compiler would just generate a simple MVC. For arrays of structures the multi-init using ''was even worse, but by doing a


array_of_structure(1) = ''

followed by

array_of_structure = array_of_structure(1)

The code was near optimal (and by having an initialized STATIC copy of the structure at hand andusing that), the code was for all intents and purposes optimal. Not so with Enterprise PL/I althoughI believe, but lacking access to EPLI 4.1 & 4.2, some issues have been addressed. Here's an exampleof OS/ PL/I V2.3.0 versus Enterprise PL/I V3R9M0:


dcl 1 rept_line(10),
      2 z00001             char       (3),
      2 tr                 pic     'zzz9',
      2 z00002             char       (3),
      2 ri                 pic     'zzz9',
      2 z00003             char       (3),
      2 da                 char       (3),
      2 z00004             char       (3),
      2 km                 pic '(3)z9v.9',
      2 z00005             char       (3),
      2 hh                 pic      'z9.',
      2 mm                 pic       '99',
      2 z00006             char       (3),
      2 v                  pic   'zz9v.9',
      2 z00007             char       (3),
      2 na                 char       (4),
      2 z00008             char       (2),
      2 ty                 char       (4),
      2 z00009             char       (3),
      2 co                 char       (4),
      2 z00010             char       (2),
      2 wa,
        3 whh              pic      'z9.',
        3 wmm              pic       '99',
      2 z00011             char       (3),
      2 sp                 char       (1),
      2 z00012             char     (255),
      2 de,
        3 dhh              pic      'z9.',
        3 dmm              pic       '99',
      2 z00013             char       (3),
      2 ar,
        3 ahh              pic      'z9.',
        3 amm              pic       '99',
      2 z00014             char       (3),
      2 date,
        3 year             pic     '9999',
        3 z00015           char       (1),
        3 month            pic       '99',
        3 z00016           char       (1),
        3 day              pic       '99';

/* Just fill the first element with something total random */
/* Q&D, too long, not good, but that's safe in PL/I        */
string(rept_line(1)) = '.z,dmbvn;aehj,mzncbkmsdlkjsjsndvfkl\hjsb' ||
                       'fjhbc.blkwaioyuh.m,jdnsvkjxbvhjbzdfwtytk' ||
                       'vkjbnsegfirahjgouegkjnzkjgh8eryghkjghxjv' ||
                       'uye9tkjgkjvuhkjzxng-oipu8ynkjh4268srtjsc' ||
                       'uhkdlgozdugjnrg;hzdfgi.zdlnhg;zfhjgiozdh' ||
                       'iorjhdhgjzndg;hzdohgjdrhgjiozd-862jhaso9' ||
                       'fhhgoishiojsdrhjdiuhz,dmbvn;aehj,mzncbkk' ||
                       'l59uyjhlkxjbxofyixhjjhbc.blkwaioyuh.m,j0' ||
                       'ftkjgkjvuhkjzxng-oipkjbnsegfirahjgouegkk' ||
                       'llgozdugjnrg;hzdfgi.mbx/hjjxfhj(*^^^%$?0';
rept_line = rept_line(1);

The code generated by OS PL/I V2.3.0 - OPT(2):

* STATEMENT NUMBER  15
0000E0  41 E0 D 0D0            LA    14,REPT_LINE.Z00001+357
0000E4  50 E0 D 0C8            ST    14,200(0,13)
0000E8  41 70 D 0C8            LA    7,200(0,13)
0000EC  50 70 3 57C            ST    7,1404(0,3)
0000F0  41 10 3 57C            LA    1,1404(0,3)
0000F4  58 F0 3 020            L     15,A..IBMBAPMA
0000F8  05 EF                  BALR  14,15

Not very nice, a call to the library, but once in a program? We have to live with it if we choosethis kind of initialization.


* STATEMENT NUMBER  16
0000FA  41 90 D 0D0            LA    9,REPT_LINE.Z00001+357
0000FE  41 80 0 00C            LA    8,12(0,0)
000102  41 70 D 235            LA    7,REPT_LINE.Z00001+714
000106                    CL.7 EQU   *
000106  D2 FF 7 000 9 000      MVC   0(256,7),0(9)
00010C  41 70 7 100            LA    7,256(0,7)
000110  41 90 9 100            LA    9,256(0,9)
000114  46 80 2 02A            BCT   8,CL.7
000118  D2 8C 7 000 9 000      MVC   0(141,7),0(9)

Inner loop: MVC, 2 x LA and BCT

The code generated by Enterprise PL/I V3R9 OPT(3), ARCH(9) - 108 is statement 
15 above, 119 is 17:

00008C  4110  D0CC        108 |      LA    r1,REPT_LINE(,r13,204)
000090  A709  0001        119 |      LGHI  r0,H'1'

Optimizer seemed to have moved other code here...
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
000094  5020  2010        002 |      ST    r2,<s43:d16:l4>(,r2,16)
000098  9231  D0C1        059 |      MVI   _Sfi(r13,193),49
00009C  E3F0  3004  0014  059 |      LGF   r15,=A(_ON_Begin_60_Blk_2)(,r3,4)
0000A2  50D0  DF04        059 |      ST    r13,<a1:d3844:l4>(,r13,3844)
0000A6  50F0  DF00        059 |      ST    r15,<a1:d3840:l4>(,r13,3840)
0000AA  E3E0  DF00  0004  059 |      LG    r14,_temp1(,r13,3840)
0000B0  E3E0  D0C4  0024  059 |      STG   r14,_Sfi(,r13,196)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

0000B6  C0E0  0000  00E9  108 |      LARL  r14,F'233'
0000BC  D2FF  1000  E1E8  108 |      MVC   _...(256,r1,0),'.z,...'(r14,488)
0000C2  D264  1100  E2E8  108 |      MVC   
_shad...(101,r1,256),'.z,...'(r14,744)

0000C8                    119 | @1L2 DS    0H
0000C8  EBE0  0009  000D  119 |      SLLG  r14,r0,9      r14 = 512 x r0
0000CE  EB10  0007  000D  119 |      SLLG  r1,r0,7       r1  = 128 x r0
0000D4  EBF0  0005  000D  119 |      SLLG  r15,r0,5      r15 =  32 x r0
0000DA  1FE1              119 |      SLR   r14,r1        r14 = 384 x r0
0000DC  EB40  0002  000D  119 |      SLLG  r4,r0,2       r4  =   4 x r0
0000E2  1FEF              119 |      SLR   r14,r15       r14 = 352 x r0
0000E4  B904  0010        119 |      LGR   r1,r0         r1  = r0
0000E8  1EE4              119 |      ALR   r14,r4        r14 = 356 x r0
0000EA  A70A  0001        119 |      AHI   r0,H'1'
0000EE  1E1E              119 |      ALR   r1,r14        r1  = 357 x r0
0000F0  E311  DF67  FF71  119 |      LAY   r1,REPT_LINE(r1,r13,-153)
0000F6  D2FF  1000  D0CC  119 |      MVC   
REPT_LINE(256,r1,0),REPT_LINE(r13,204)
0000FC  D264  1100  D1CC  119 |      MVC   
REPT_LINE(101,r1,256),REPT_LINE(r13,460)
000102  EC0C  FFE3  0A7E  119 |      CIJNH r0,H'10',@1L2

Inner loop: WTH! For crying out loud... Is this really a "fast" multiply by 357??? And why wastethree extra registers on it??? Oh yes, because the instructions overlap...


The equivalent inner loop using ARCH(5), the lowest possible by EPLI V3R9:

0000D8                    000119 | @1L2 DS  0H
0000D8  B904  00E0        000119 |      LGR r14,r0
0000DC  A7EC  0165        000119 |      MHI r14,H'357'
0000E0  A70A  0001        000119 |      AHI r0,H'1'
0000E4  A70E  000A        000119 |      CHI r0,H'10'
0000E8  41EE  FE9A        000119 |      LA  r14,REPT_LINE(r14,r15,3738)
0000EC  D2FF  E000  1000  000119 |      MVC REPT_LINE(256,r14,0),REPT_LINE(r1,0)
0000F2  D264  E100  1100  000119 |      MVC 
REPT_LINE(101,r14,256),REPT_LINE(r1,256)
0000F8  A7D4  FFF0        000119 |      JNH @1L2

OK, it contains a multiply, a "slow" instruction, be it that it can be made pretty fast if you lookat the x86 offerings from AMD & Intel (Sandy Bridge: 64 bit mul in 3 cycles). However, given thatthis is a normal non-interleaved array, why do you need a multiplication at all. The V2.3.0 compilerclearly demonstrated that you don't, and did so almost three decades ago!!!


Again, I just observe, your boss picks up the bill for the CPU cycles used...

If your company is paying thousands of dollars per year to be able to use Enterprise PL/I, don't youthink you are entitled to a compiler that generates the best possible code?


Robert
--
Robert AH Prins
robert(a)prino(d)org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Comparison of compiler generated code AD 1980(ish) v 2010(ish)

Reply via email to