I would assume there is some sort of a compiler/hardware architecture liaison group within IBM. I would bet that if someone from that group were to put together a SHARE presentation called "Write Machine Code Like a Compiler -- How to Write the Fastest Code Possible for the z13 (z14, whatever)" that it would be a big hit.
Charles -----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Jim Mulder Sent: Monday, December 28, 2015 9:57 AM To: [email protected] Subject: Re: Is there a source for detailed, instruction-level performance info? > An example: which of these code sequences do you suppose runs faster? > > la rX,0(rI,rBase) rX -> array[i] > lg rY,0(,rX) rY = array[i] > agsi 0(rX),1 ++array[i] > * Now do something with rY > > vs: > lg rY,0(rI,rBase) rY = array[i] > la rX,1(,rY) rX = rY + 1 > stg rX,0(rI,rBase) effect is ++array[i] > * Now do something with rY > > The first is substantially faster. I would have GUESSED that the > second would be faster, since I need the value in rY anyway. (I'm in > 64-bit mode, so using "LOAD ADDRESS for the increment is safe...) "Substantially faster" is probably a cache effect. Assuming a cache miss on array[i], in sequence 2, the LG will miss and install the cache line shared, and then the STG will need to do an upgrade to exclusive. The AGSI in sequence 1 will miss and install the cache line exclusive, avoiding the upgrade to exclusive. Adding a PFD 2,0(rI,rBase) before the LG in sequence 2 may make these sequences perform similarly. Also, in sequence 1, changing lg rY(0(,Rx) to lg rY,0(rI,rBase) may avoid some Address Generation Interlock effects (although various machines have various AGI bypasses for various instructions). And it may just transfer some of the AGI effect from the LG down to the AGSI. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
