Re: An explanation for branch performance?
On Sat, 30 Apr 2016 10:55:47 +0200, Bernd Oppolzer wrote: >there are IBM and home grown macros which put their parameter lists >inline and branch around them Use the List and Execute forms. The standard forms are cause the I-cache to have to be re-read because of the Store In Instruction Stream. -- Tom Marchant -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
An explanation for branch performance?
To summarize, Whether using base-displacement or relative branches, the three test programs being discussed are, in effect: (branch never taken, best) TEST CSECT LLILF 4,1000*1000*1000 LTR 4,4 NEXT JNP NEXT1 NEXT1JCT 4,NEXT BR14 END TEST and (conditional branch, always taken, worst) TEST CSECT LLILF 4,1000*1000*1000 LTR 4,4 NEXT JP NEXT1 NEXT1JCT 4,NEXT BR14 END TEST and (unconditional branch, always taken, middle) TEST CSECT LLILF 4,1000*1000*1000 LTR 4,4 NEXT J NEXT1 NEXT1JCT 4,NEXT BR14 END TEST Peter Relson z/OS Core Technology Design -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Am 29.04.2016 um 18:59 schrieb Jim Mulder: No. It's the opposite which is why I originally posted. The unconditional branch is slower and I want to know why. The relevant comparison is not conditional branch vs. unconditional branch. It is branch not taken vs. branch taken. Sequential execution is always best. Branch prediction tries to mitigate some of the effects of nonsequential execution. Jim Mulder z/OS System Test IBM Corp. Poughkeepsie, NY IMO, the interesting point about this (for ASSEMBLER programmers and legacy code) is: there are IBM and home grown macros which put their parameter lists inline and branch around them, although there are other possibilities (separating parameters from the execution path using MF=L, MF=E or register notation etc.). Such macros should be avoided in tight loops. An example which occured to me was a LOOP macro of a SP macro package, where the loop control variable was a packed decimal variable defined in the instruction stream and branched over. This is even worse, I believe, because the I-cache is invalidated by stores into (or computations using) the loop control variable. I changed this, when I made the whole package sensitive to a global variable that signalled "no base register present" (AKA baseless); then the variable is put in a well-known DSECT for auto variables, which is defined in the module's startup macro. Kind regards Bernd -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29 April 2016 at 12:59, Jim Mulderwrote: > The relevant comparison is not conditional branch vs. > unconditional branch. It is branch not taken vs. branch taken. > Sequential execution is always best. Branch prediction tries to > mitigate some of the effects of nonsequential execution. Right. And presumably even an "unconditional" branch that is actually a branch on condition with a CC mask of 15 can be mispredicted, in theory. And therefore the instruction fetch stream at the address following the branch will keep on fetching, even though it fully expects to switch to the new stream at the branch target address. And then that first stream will have to be thrown away, which presumably isn't free. In the tiny example case, the stream after the branch is not only a valid instruction (which it might or might not be in the case of branching over an eyecatcher), but it's another branch, which presumably gets predicted in its own right, even though it's at the same address as the first branch's target. Now what about a truly unconditional branch, i.e. one that doesn't depend on the condition code? Would BRAS perform better in this regard than even J, or is the cost of saving stuff in r1 very high? Surely there is no reason to continue fetching after such an instruction. It can't not branch, and it can't program check. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
> >> Good point well made but can you explain why changing a B to a BE > in a tight loop results in 43% difference? > > But aren't those two completely different cases (even if it is the > same instruction)? The first is an unconditional branch, the second > one a conditional branch. That probably makes a big difference to > the processor. Where I expect the unconditional one to be faster > than the conditional one. > > > > I assume it is 43% faster than the conditional one? If it is the > other way around I will be very surprised as well. > > No. It's the opposite which is why I originally posted. The > unconditional branch is slower and I want to know why. The relevant comparison is not conditional branch vs. unconditional branch. It is branch not taken vs. branch taken. Sequential execution is always best. Branch prediction tries to mitigate some of the effects of nonsequential execution. Jim Mulder z/OS System Test IBM Corp. Poughkeepsie, NY -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
>> I assume it is 43% faster than the conditional one? If it is the other way >> around I will be very surprised as well. > > No. It's the opposite which is why I originally posted. The unconditional > branch is slower and I want to know why. That's probably 'a branch not taken is always faster than a branch taken (even if predicted correctly). But by now I've lost the start of this discussion: there isn't really a way to choose between a B or BE, you either need a conditional branch or you need an unconditional one. So it doesn't seem that relevant if one is faster than the other. And if we are talking about jumping over eye catchers: that is always gonna be an unconditional branch. Removing the branch all together by moving the eyecatcher or the entry point must be an improvement. But that is a different discussion. Am I missing something? Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29 April 2016 at 12:06, David Crayfordwrote: > On 29/04/2016 11:55 PM, Tony Harminc wrote: >> >> On 29 April 2016 at 11:50, Charles Mills wrote: >>> >>> What about substituting a branch relative for the branch on base >>> register? Trivial code change to make. >> >> I was about to suggest that too. All the IBM published material I've >> seen on this suggests that > > > It was a simple test case to exercise our legacy code base which issue > non-relative unconditional branches over eye-catchers. It's non-trivial to > change and test a huge code base and certainly you wouldn't want to without > understanding the problem. But there's certainly an issue as our customers > have reported. You quoted me, but snipped out everything I actually said... Which I think does partly address your question. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 30/04/2016 12:29 AM, Windt, W.K.F. van der (Fred) wrote: Good point well made but can you explain why changing a B to a BE in a tight loop results in 43% difference? But aren't those two completely different cases (even if it is the same instruction)? The first is an unconditional branch, the second one a conditional branch. That probably makes a big difference to the processor. Where I expect the unconditional one to be faster than the conditional one. I assume it is 43% faster than the conditional one? If it is the other way around I will be very surprised as well. No. It's the opposite which is why I originally posted. The unconditional branch is slower and I want to know why. Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
> > Good point well made but can you explain why changing a B to a BE in a tight > loop results in 43% difference? But aren't those two completely different cases (even if it is the same instruction)? The first is an unconditional branch, the second one a conditional branch. That probably makes a big difference to the processor. Where I expect the unconditional one to be faster than the conditional one. I assume it is 43% faster than the conditional one? If it is the other way around I will be very surprised as well. Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 30/04/2016 12:23 AM, Windt, W.K.F. van der (Fred) wrote: Sent from my new iPad On 29 Apr 2016, at 18:10, Peter Relsonwrote: Since the origin for the starting post apparently lay in branching around the eyecatcher (which really is not necessarily at all the same as a branch in a 2 instruction loop), I was surprised that none of the posts that I glanced at mentioned Instruction-cache misses. Just because you think something is high frequency does not mean that the operating system or the machine agrees with you. If the module's first instruction is not in I-cache, then whether that first instruction is a branch or anything else, it will show up as a lot hotter than a somewhat similar instruction that is in the I-cache. That's an interesting point: does this mean that every instruction at the start of an 256 byte boundary will probably appear hotter because it sits at the start of a cache line and profilers will attribute the time it takes to load the line into cache to this instruction? The other instructions in the cached line quietly benefit from this behavior. That's a good question. Is there a doubt that profilers are unreliable? Jim Mulder gave a good explanation where that is certainly the case in a previous post. You could probably validate this by inserting NOPS in the code. The hotspots should shift by the length of the NOP. And the entry point of a module might be more likely to sit on a 256 byte boundary if it is also at the very start of the code of that module... Very interesting. Fred! ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Sent from my new iPad > On 29 Apr 2016, at 18:10, Peter Relsonwrote: > > Since the origin for the starting post apparently lay in branching around > the eyecatcher (which really is not necessarily at all the same as a > branch in a 2 instruction loop), I was surprised that none of the posts > that I glanced at mentioned Instruction-cache misses. > > Just because you think something is high frequency does not mean that the > operating system or the machine agrees with you. If the module's first > instruction is not in I-cache, then whether that first instruction is a > branch or anything else, it will show up as a lot hotter than a somewhat > similar instruction that is in the I-cache. That's an interesting point: does this mean that every instruction at the start of an 256 byte boundary will probably appear hotter because it sits at the start of a cache line and profilers will attribute the time it takes to load the line into cache to this instruction? The other instructions in the cached line quietly benefit from this behavior. You could probably validate this by inserting NOPS in the code. The hotspots should shift by the length of the NOP. And the entry point of a module might be more likely to sit on a 256 byte boundary if it is also at the very start of the code of that module... Very interesting. Fred! ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 30/04/2016 12:10 AM, Peter Relson wrote: Since the origin for the starting post apparently lay in branching around the eyecatcher (which really is not necessarily at all the same as a branch in a 2 instruction loop), I was surprised that none of the posts that I glanced at mentioned Instruction-cache misses. Good point well made but can you explain why changing a B to a BE in a tight loop results in 43% difference? Just because you think something is high frequency does not mean that the operating system or the machine agrees with you. If the module's first instruction is not in I-cache, then whether that first instruction is a branch or anything else, it will show up as a lot hotter than a somewhat similar instruction that is in the I-cache. Indeed. But why would there be a slowdown on a z13 when compared to older hardware? Peter Relson z/OS Core Technology Design -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Since the origin for the starting post apparently lay in branching around the eyecatcher (which really is not necessarily at all the same as a branch in a 2 instruction loop), I was surprised that none of the posts that I glanced at mentioned Instruction-cache misses. Just because you think something is high frequency does not mean that the operating system or the machine agrees with you. If the module's first instruction is not in I-cache, then whether that first instruction is a branch or anything else, it will show up as a lot hotter than a somewhat similar instruction that is in the I-cache. Peter Relson z/OS Core Technology Design -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 11:55 PM, Tony Harminc wrote: On 29 April 2016 at 11:50, Charles Millswrote: What about substituting a branch relative for the branch on base register? Trivial code change to make. I was about to suggest that too. All the IBM published material I've seen on this suggests that It was a simple test case to exercise our legacy code base which issue non-relative unconditional branches over eye-catchers. It's non-trivial to change and test a huge code base and certainly you wouldn't want to without understanding the problem. But there's certainly an issue as our customers have reported. -- all else being equal (heh), untaken branches are faster than taken ones (even if the prediction is correct), and -- the penalty for misprediction is lower for relative branches than register-based ones. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29 April 2016 at 11:50, Charles Millswrote: > What about substituting a branch relative for the branch on base register? > Trivial code change to make. I was about to suggest that too. All the IBM published material I've seen on this suggests that -- all else being equal (heh), untaken branches are faster than taken ones (even if the prediction is correct), and -- the penalty for misprediction is lower for relative branches than register-based ones. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On Fri, Apr 29, 2016 at 9:30 AM, John McKownwrote: > On Fri, Apr 29, 2016 at 9:27 AM, Mike Schwab > wrote: > >> Well, the obvious solution is to code the eyecatcher literals before >> the entry point. It will be less obvious that the eyecatcher is part >> of the program (and not the end of the previous program) but as the >> technique become more widespread it should become more trusted. >> >> > IBM has a ton of recoding to do. I've seen this type of thing in a _lot_ > of IBM routines. > > -- > The unfacts, did we have them, are too imprecisely few to warrant our > certitude. > > Maranatha! <>< > John McKown > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe_ID=87642 -- Mike A Schwab, Springfield IL USA Where do Forest Rangers go to get away from it all? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
AW: Re: An explanation for branch performance?
>> >> L R4,=A(1*1000*1000*1000) >> LTR R4,R4 >> J LOOP >> * >> LOOP DS 0D .LOOP START >> B NEXT >> >> NEXT JCT R4,LOOP > > The loop starts with a branch ... I tested it twice - when the CC is matched > (branch happens) and when it is not matched (falls through) > > 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds > 2. When the CC is not matched the code falls through, CPU TIME=1.69 > seconds - a reduction of 42% > Uhm... I don't see any conditional branch at the start of the loop that > branches or falls through? I'm with Fred here. Out of curiosity, the code you posted seems to be incomplete. Is it? -- Peter Hunkeler -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 11:40 PM, Jim Mulder wrote: Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. IBM has a ton of recoding to do. I've seen this type of thing in a _lot_ of IBM routines. Indeed! Including SVC routines which show up in our profiling! How is the sampling done for your profiling? Keep in mind that sampling which is software-based and driven by external interrupts will charge the time spent in the SVC interrupt handler to the beginning of the SVC routine, since the SVC interrupt handler is disabled for external interrupts. We use IBM Application Performance Analyzer. Thank you, that's very good information. Jim Mulder z/OS System Test IBM Corp. Poughkeepsie, NY -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
I think all issues of this general type are incredibly difficult to analyze reliably because the hardware is so darned complex now. There are so many more variables than back in the day when you could say "a branch consumes 'n' microseconds" or "'n' microseconds if taken, 'm' microseconds if not." What about substituting a branch relative for the branch on base register? Trivial code change to make. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Jim Mulder Sent: Friday, April 29, 2016 8:41 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: An explanation for branch performance? > >> Well, the obvious solution is to code the eyecatcher literals > >> before the entry point. It will be less obvious that the > >> eyecatcher is part of the program (and not the end of the previous > >> program) but as the technique become more widespread it should become more > >> trusted. > >> > >> > > IBM has a ton of recoding to do. I've seen this type of thing in a _lot_ > > of IBM routines. > > > > Indeed! Including SVC routines which show up in our profiling! How is the sampling done for your profiling? Keep in mind that sampling which is software-based and driven by external interrupts will charge the time spent in the SVC interrupt handler to the beginning of the SVC routine, since the SVC interrupt handler is disabled for external interrupts. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
This is very interesting. It explains what I thought was an anomaly of the CA TriTune product, which we use here for profiling our in-house code. Strictly enforced standards here require inclusion of automatically customized code at the start of every COBOL procedure (main or subroutine) that DISPLAY's program name and version and compile date/time information at the first pass and bypasses the DISPLAY every other time using "ALTER ... GO TO" logic. The "ALTER ... GO TO" paragraphs now show up as serious hot spots in high-frequency subroutines where it never did before since we moved to z13 CEC's. Many thanks for this investigation. Peter -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of David Crayford Sent: Friday, April 29, 2016 10:48 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: An explanation for branch performance? On 29/04/2016 10:34 PM, Joe Testa wrote: > There seems to be little point worrying about the time needed to branch past > an eyecatcher at the start of a program, compared to the time used by the > rest of the program. Unfortunately that's not true. For high frequency subroutines it can dominate the performance profile. We have customer feedback where the code has been profiled using APA and the hot spots are clearly at the branch over eye-cachers. The reason I'm asking the question is for a reason why? The customer suggested we were non re-entrant and saving registers into the instruction stream. Our code is re-entrant. -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Try this on your hardware and post the results. Uncomment/Comment where applicable. BENCHCSECT BENCHAMODE 31 SAVE (14,12) LRR12,R15 USING BENCH,R12 L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START *B NEXT .UNCONDICTIONAL BENEXT .CONDITIONAL NEXT JCT R4,LOOP RETURN (14,12) YREGS , END On 29/04/2016 9:35 PM, Windt, W.K.F. van der (Fred) wrote: Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% Uhm... I don't see any conditional branch at the start of the loop that branches or falls through? Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Try this on your hardware and post the results. Uncomment/Comment where applicable. BENCHCSECT BENCHAMODE 31 SAVE (14,12) LRR12,R15 USING BENCH,R12 L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START *B NEXT .UNCONDICTIONAL BENEXT .CONDITIONAL NEXT JCT R4,LOOP LRR15,0 RETURN (14,12) YREGS , END On 29/04/2016 9:35 PM, Windt, W.K.F. van der (Fred) wrote: Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% Uhm... I don't see any conditional branch at the start of the loop that branches or falls through? Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
> >> Well, the obvious solution is to code the eyecatcher literals before > >> the entry point. It will be less obvious that the eyecatcher is part > >> of the program (and not the end of the previous program) but as the > >> technique become more widespread it should become more trusted. > >> > >> > > IBM has a ton of recoding to do. I've seen this type of thing in a _lot_ > > of IBM routines. > > > > Indeed! Including SVC routines which show up in our profiling! How is the sampling done for your profiling? Keep in mind that sampling which is software-based and driven by external interrupts will charge the time spent in the SVC interrupt handler to the beginning of the SVC routine, since the SVC interrupt handler is disabled for external interrupts. Jim Mulder z/OS System Test IBM Corp. Poughkeepsie, NY -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Plus it's really not an assembler question. The assembler is a program that turns 'B' into '47F0'. This is a hardware performance question. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of David Crayford Sent: Friday, April 29, 2016 8:23 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: An explanation for branch performance? On 29/04/2016 11:10 PM, Lizette Koehler wrote: > Maybe the IBM Assembler List might be helpful here? > > If you have not joined, use this URL: > https://listserv.uga.edu/cgi-bin/wa?A0=ASSEMBLER-LIST > > Lizette See my earlier response to Elardus. The assembler list is almost moribund. Everybody posts here with these kind of questions because the audience is much broader. Most of the old regulars from ASSEMBLER-LIST now converse on linkedin groups. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 11:10 PM, Lizette Koehler wrote: Maybe the IBM Assembler List might be helpful here? If you have not joined, use this URL: https://listserv.uga.edu/cgi-bin/wa?A0=ASSEMBLER-LIST Lizette See my earlier response to Elardus. The assembler list is almost moribund. Everybody posts here with these kind of questions because the audience is much broader. Most of the old regulars from ASSEMBLER-LIST now converse on linkedin groups. -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of David Crayford Sent: Friday, April 29, 2016 7:55 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: An explanation for branch performance? On 29/04/2016 10:27 PM, Mike Schwab wrote: Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. Thanks! We already know the solution. I looking for an answer. I'm a C/C++ coder by trade and Metal/C has a neat FPB control block for the eyecatchers which are pointed to by an offset just above the entry point. ENTRY @@CCN@240 @@CCN@240 AMODE 31 DCXL8'00C300C300D50100' Function Entry Point Marker DCA(@@FPB@4-*+8) Signed offset to FPB DCXL4'' Reserved @@CCN@240 DS 0F @@LIT@4 LTORG @@FPB@ LOCTR @@FPB@4 DS0F Function Property Block DCXL2'CCD5' Eyecatcher DCBL2'0011' Saved GPR Mask DCA(@@PFD@@-@@FPB@4) Signed Offset to Prefix Data DCBL1'' Flag Set 1 DCBL1'1000' Flag Set 2 DCBL1'' Flag Set 3 DCBL1'0001' Flag Set 4 DCXL4'' Reserved DCXL4'' Reserved DCAL2(12) DCC'avl_iter_cur' On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote: On 29/04/2016 10:09 PM, Mike Schwab wrote: The pipeline is optimized for running many instructions in a row. A branch is not recognized until through a good part of the pipeline. Meanwhile the data to be skipped is in the instruction pipeline. Results meet expectations. So branching over eyecatchers is expected to be x2 slower on a z13 than a z114? I was always lead to believe that new hardware always ran old code faster unless it was doing nasty stuff like storing into the instruction stream. On Fri, Apr 29, 2016 at 7:40 AM, David Crayford <dcrayf...@gmail.com> wrote: We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 10:27 PM, Mike Schwab wrote: Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. Thanks! We already know the solution. I looking for an answer. I'm a C/C++ coder by trade and Metal/C has a neat FPB control block for the eyecatchers which are pointed to by an offset just above the entry point. ENTRY @@CCN@240 @@CCN@240 AMODE 31 DCXL8'00C300C300D50100' Function Entry Point Marker DCA(@@FPB@4-*+8) Signed offset to FPB DCXL4'' Reserved @@CCN@240 DS 0F @@LIT@4 LTORG @@FPB@ LOCTR @@FPB@4 DS0F Function Property Block DCXL2'CCD5' Eyecatcher DCBL2'0011' Saved GPR Mask DCA(@@PFD@@-@@FPB@4) Signed Offset to Prefix Data DCBL1'' Flag Set 1 DCBL1'1000' Flag Set 2 DCBL1'' Flag Set 3 DCBL1'0001' Flag Set 4 DCXL4'' Reserved DCXL4'' Reserved DCAL2(12) DCC'avl_iter_cur' On Fri, Apr 29, 2016 at 9:13 AM, David Crayfordwrote: On 29/04/2016 10:09 PM, Mike Schwab wrote: The pipeline is optimized for running many instructions in a row. A branch is not recognized until through a good part of the pipeline. Meanwhile the data to be skipped is in the instruction pipeline. Results meet expectations. So branching over eyecatchers is expected to be x2 slower on a z13 than a z114? I was always lead to believe that new hardware always ran old code faster unless it was doing nasty stuff like storing into the instruction stream. On Fri, Apr 29, 2016 at 7:40 AM, David Crayford wrote: We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Maybe the IBM Assembler List might be helpful here? If you have not joined, use this URL: https://listserv.uga.edu/cgi-bin/wa?A0=ASSEMBLER-LIST Lizette > -Original Message- > From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On > Behalf Of David Crayford > Sent: Friday, April 29, 2016 7:55 AM > To: IBM-MAIN@LISTSERV.UA.EDU > Subject: Re: An explanation for branch performance? > > On 29/04/2016 10:27 PM, Mike Schwab wrote: > > Well, the obvious solution is to code the eyecatcher literals before > > the entry point. It will be less obvious that the eyecatcher is part > > of the program (and not the end of the previous program) but as the > > technique become more widespread it should become more trusted. > > Thanks! We already know the solution. I looking for an answer. I'm a C/C++ > coder by trade and Metal/C has a neat FPB control block for the eyecatchers > which are pointed to by an offset just above the entry point. > > ENTRY @@CCN@240 > @@CCN@240 AMODE 31 > DCXL8'00C300C300D50100' Function Entry Point Marker > DCA(@@FPB@4-*+8) Signed offset to FPB > DCXL4'' Reserved > @@CCN@240 DS 0F > > @@LIT@4 LTORG > @@FPB@ LOCTR > @@FPB@4 DS0F Function Property Block > DCXL2'CCD5' Eyecatcher > DCBL2'0011' Saved GPR Mask > DCA(@@PFD@@-@@FPB@4) Signed Offset to Prefix Data > DCBL1'' Flag Set 1 > DCBL1'1000' Flag Set 2 > DCBL1'' Flag Set 3 > DCBL1'0001' Flag Set 4 > DCXL4'' Reserved > DCXL4'' Reserved > DCAL2(12) > DCC'avl_iter_cur' > > > On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote: > >> On 29/04/2016 10:09 PM, Mike Schwab wrote: > >>> The pipeline is optimized for running many instructions in a row. A > >>> branch is not recognized until through a good part of the pipeline. > >>> Meanwhile the data to be skipped is in the instruction pipeline. > >>> > >>> Results meet expectations. > >> > >> So branching over eyecatchers is expected to be x2 slower on a z13 > >> than a z114? I was always lead to believe that new hardware always > >> ran old code faster unless it was doing nasty stuff like storing into > >> the instruction stream. > >> > >> > >>> On Fri, Apr 29, 2016 at 7:40 AM, David Crayford > >>> <dcrayf...@gmail.com> > >>> wrote: > >>>> We're doing some performance work on our assembler code and one of > >>>> my colleagues ran the following test which was surprising. > >>>> Unconditional branching can add significant overhead. I always > >>>> believed that conditional branches were expensive because the > >>>> branch predictor needed to do more work and unconditional branches > >>>> were easy to predict. Does anybody have an explanation for this. > >>>> Our machine is z114. It appears that it's even worse on a z13. > >>>> > >>>> Here's the code. > >>>> > >>>> I wrote a simple program - it tight loops 1 billion times > >>>> > >>>> > >>>>L R4,=A(1*1000*1000*1000) > >>>>LTR R4,R4 > >>>>J LOOP > >>>> * > >>>> LOOP DS 0D .LOOP START > >>>>B NEXT > >>>> > >>>> NEXT JCT R4,LOOP > >>>> > >>>> The loop starts with a branch ... I tested it twice - when the CC > >>>> is matched (branch happens) and when it is not matched (falls > >>>> through) > >>>> > >>>> 1. When the CC is matched and branching happens, CPU TIME=2.94 > >>>> seconds 2. When the CC is not matched the code falls through, CPU > >>>> TIME=1.69 seconds > >>>> - a reduction of 42% > >>>> -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 10:30 PM, John McKown wrote: On Fri, Apr 29, 2016 at 9:27 AM, Mike Schwabwrote: Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. IBM has a ton of recoding to do. I've seen this type of thing in a _lot_ of IBM routines. Indeed! Including SVC routines which show up in our profiling! -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 9:35 PM, Windt, W.K.F. van der (Fred) wrote: Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% Uhm... I don't see any conditional branch at the start of the loop that branches or falls through? Modify that snippet toBENEXT and see what your results are. Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 10:34 PM, Joe Testa wrote: There seems to be little point worrying about the time needed to branch past an eyecatcher at the start of a program, compared to the time used by the rest of the program. Unfortunately that's not true. For high frequency subroutines it can dominate the performance profile. We have customer feedback where the code has been profiled using APA and the hot spots are clearly at the branch over eye-cachers. The reason I'm asking the question is for a reason why? The customer suggested we were non re-entrant and saving registers into the instruction stream. Our code is re-entrant. From: Mike Schwab Sent: Friday, April 29, 2016 10:27 AM Newsgroups: bit.listserv.ibm-main To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: An explanation for branch performance? Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote: On 29/04/2016 10:09 PM, Mike Schwab wrote: The pipeline is optimized for running many instructions in a row. A branch is not recognized until through a good part of the pipeline. Meanwhile the data to be skipped is in the instruction pipeline. Results meet expectations. So branching over eyecatchers is expected to be x2 slower on a z13 than a z114? I was always lead to believe that new hardware always ran old code faster unless it was doing nasty stuff like storing into the instruction stream. On Fri, Apr 29, 2016 at 7:40 AM, David Crayford <dcrayf...@gmail.com> wrote: We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On Fri, Apr 29, 2016 at 9:27 AM, Mike Schwabwrote: > Well, the obvious solution is to code the eyecatcher literals before > the entry point. It will be less obvious that the eyecatcher is part > of the program (and not the end of the previous program) but as the > technique become more widespread it should become more trusted. > > IBM has a ton of recoding to do. I've seen this type of thing in a _lot_ of IBM routines. -- The unfacts, did we have them, are too imprecisely few to warrant our certitude. Maranatha! <>< John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
There seems to be little point worrying about the time needed to branch past an eyecatcher at the start of a program, compared to the time used by the rest of the program. From: Mike Schwab Sent: Friday, April 29, 2016 10:27 AM Newsgroups: bit.listserv.ibm-main To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: An explanation for branch performance? Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote: > On 29/04/2016 10:09 PM, Mike Schwab wrote: >> >> The pipeline is optimized for running many instructions in a row. A >> branch is not recognized until through a good part of the pipeline. >> Meanwhile the data to be skipped is in the instruction pipeline. >> >> Results meet expectations. > > > So branching over eyecatchers is expected to be x2 slower on a z13 than a > z114? I was always lead to believe that new hardware always ran old code > faster unless it was doing nasty stuff like storing into the instruction > stream. > > >> >> On Fri, Apr 29, 2016 at 7:40 AM, David Crayford <dcrayf...@gmail.com> >> wrote: >>> >>> We're doing some performance work on our assembler code and one of my >>> colleagues ran the following test which was surprising. Unconditional >>> branching can add significant overhead. I always believed that >>> conditional >>> branches were expensive because the branch predictor needed to do more >>> work >>> and unconditional branches were easy to predict. Does anybody have an >>> explanation for this. Our machine is z114. It appears that it's even >>> worse >>> on a z13. >>> >>> Here's the code. >>> >>> I wrote a simple program - it tight loops 1 billion times >>> >>> >>> L R4,=A(1*1000*1000*1000) >>> LTR R4,R4 >>> J LOOP >>> * >>> LOOP DS 0D .LOOP START >>> B NEXT >>> >>> NEXT JCT R4,LOOP >>> >>> The loop starts with a branch ... I tested it twice - when the CC is >>> matched >>> (branch happens) and when it is not matched (falls through) >>> >>> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds >>> 2. When the CC is not matched the code falls through, CPU TIME=1.69 >>> seconds >>> - a reduction of 42% >>> >>> -- >>> For IBM-MAIN subscribe / signoff / archive access instructions, >>> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN >> >> >> > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- Mike A Schwab, Springfield IL USA Where do Forest Rangers go to get away from it all? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
Well, the obvious solution is to code the eyecatcher literals before the entry point. It will be less obvious that the eyecatcher is part of the program (and not the end of the previous program) but as the technique become more widespread it should become more trusted. On Fri, Apr 29, 2016 at 9:13 AM, David Crayfordwrote: > On 29/04/2016 10:09 PM, Mike Schwab wrote: >> >> The pipeline is optimized for running many instructions in a row. A >> branch is not recognized until through a good part of the pipeline. >> Meanwhile the data to be skipped is in the instruction pipeline. >> >> Results meet expectations. > > > So branching over eyecatchers is expected to be x2 slower on a z13 than a > z114? I was always lead to believe that new hardware always ran old code > faster unless it was doing nasty stuff like storing into the instruction > stream. > > >> >> On Fri, Apr 29, 2016 at 7:40 AM, David Crayford >> wrote: >>> >>> We're doing some performance work on our assembler code and one of my >>> colleagues ran the following test which was surprising. Unconditional >>> branching can add significant overhead. I always believed that >>> conditional >>> branches were expensive because the branch predictor needed to do more >>> work >>> and unconditional branches were easy to predict. Does anybody have an >>> explanation for this. Our machine is z114. It appears that it's even >>> worse >>> on a z13. >>> >>> Here's the code. >>> >>> I wrote a simple program - it tight loops 1 billion times >>> >>> >>> L R4,=A(1*1000*1000*1000) >>> LTR R4,R4 >>> J LOOP >>> * >>> LOOP DS 0D .LOOP START >>> B NEXT >>> >>> NEXT JCT R4,LOOP >>> >>> The loop starts with a branch ... I tested it twice - when the CC is >>> matched >>> (branch happens) and when it is not matched (falls through) >>> >>> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds >>> 2. When the CC is not matched the code falls through, CPU TIME=1.69 >>> seconds >>> - a reduction of 42% >>> >>> -- >>> For IBM-MAIN subscribe / signoff / archive access instructions, >>> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN >> >> >> > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- Mike A Schwab, Springfield IL USA Where do Forest Rangers go to get away from it all? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 10:09 PM, Mike Schwab wrote: The pipeline is optimized for running many instructions in a row. A branch is not recognized until through a good part of the pipeline. Meanwhile the data to be skipped is in the instruction pipeline. Results meet expectations. So branching over eyecatchers is expected to be x2 slower on a z13 than a z114? I was always lead to believe that new hardware always ran old code faster unless it was doing nasty stuff like storing into the instruction stream. On Fri, Apr 29, 2016 at 7:40 AM, David Crayfordwrote: We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
> Here's the code. > > I wrote a simple program - it tight loops 1 billion times > > > L R4,=A(1*1000*1000*1000) > LTR R4,R4 > J LOOP > * > LOOP DS 0D .LOOP START > B NEXT > > NEXT JCT R4,LOOP > > The loop starts with a branch ... I tested it twice - when the CC is matched > (branch happens) and when it is not matched (falls through) > > 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds > 2. When the CC is not matched the code falls through, CPU TIME=1.69 > seconds - a reduction of 42% Uhm... I don't see any conditional branch at the start of the loop that branches or falls through? Fred! -- ATTENTION: The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient , don't use or disclose it in anyway. Please let the sender know and delete the message immediately. -- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
The pipeline is optimized for running many instructions in a row. A branch is not recognized until through a good part of the pipeline. Meanwhile the data to be skipped is in the instruction pipeline. Results meet expectations. On Fri, Apr 29, 2016 at 7:40 AM, David Crayfordwrote: > We're doing some performance work on our assembler code and one of my > colleagues ran the following test which was surprising. Unconditional > branching can add significant overhead. I always believed that conditional > branches were expensive because the branch predictor needed to do more work > and unconditional branches were easy to predict. Does anybody have an > explanation for this. Our machine is z114. It appears that it's even worse > on a z13. > > Here's the code. > > I wrote a simple program - it tight loops 1 billion times > > > L R4,=A(1*1000*1000*1000) > LTR R4,R4 > J LOOP > * > LOOP DS 0D .LOOP START > B NEXT > > NEXT JCT R4,LOOP > > The loop starts with a branch ... I tested it twice - when the CC is matched > (branch happens) and when it is not matched (falls through) > > 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds > 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds > - a reduction of 42% > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- Mike A Schwab, Springfield IL USA Where do Forest Rangers go to get away from it all? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 9:08 PM, Vernooij, CP (ITOPT1) - KLM wrote: I wonder if it is really a 'problem'. What kind of mormal program does this amount of branches back to where it just came from, messing up the entire pipeline. Is it efficient to optimize a z13 for this kind of programs? Isn't it better to optimize for the millions of other programs who are executed much more often than this one? The issue was branching over eyecatchers in frequently called service routines which has proved to be a performance bottleneck on z13 hardware (and z114). That was a common idiom for legacy assembler code. Kees. -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of David Crayford Sent: 29 April, 2016 14:41 To: IBM-MAIN@LISTSERV.UA.EDU Subject: An explanation for branch performance? We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
I wonder if it is really a 'problem'. What kind of mormal program does this amount of branches back to where it just came from, messing up the entire pipeline. Is it efficient to optimize a z13 for this kind of programs? Isn't it better to optimize for the millions of other programs who are executed much more often than this one? Kees. -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of David Crayford Sent: 29 April, 2016 14:41 To: IBM-MAIN@LISTSERV.UA.EDU Subject: An explanation for branch performance? We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
On 29/04/2016 8:46 PM, Elardus Engelbrecht wrote: David Crayford wrote: We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Hmmm, interesting, but I'm clueless about this one. Perhaps someone lurking on Assembler-L can help you out? Could you post your message there? Indeed! But most of the guys in the know hang out here, which is the epicenter of knowledge in our world. I do post on MVS-OE for Unix stuff because some of the IBM developers of USS only hang out there. Groete / Greetings Elardus Engelbrecht -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: An explanation for branch performance?
David Crayford wrote: >We're doing some performance work on our assembler code and one of my >colleagues ran the following test which was surprising. Unconditional >branching can add significant overhead. I always believed that conditional >branches were expensive because the branch predictor needed to do more work >and unconditional branches were easy to predict. Does anybody have an >explanation for this. Our machine is z114. It appears that it's even worse on >a z13. Hmmm, interesting, but I'm clueless about this one. Perhaps someone lurking on Assembler-L can help you out? Could you post your message there? Groete / Greetings Elardus Engelbrecht -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
An explanation for branch performance?
We're doing some performance work on our assembler code and one of my colleagues ran the following test which was surprising. Unconditional branching can add significant overhead. I always believed that conditional branches were expensive because the branch predictor needed to do more work and unconditional branches were easy to predict. Does anybody have an explanation for this. Our machine is z114. It appears that it's even worse on a z13. Here's the code. I wrote a simple program - it tight loops 1 billion times L R4,=A(1*1000*1000*1000) LTR R4,R4 J LOOP * LOOP DS 0D .LOOP START B NEXT NEXT JCT R4,LOOP The loop starts with a branch ... I tested it twice - when the CC is matched (branch happens) and when it is not matched (falls through) 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds - a reduction of 42% -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN