Re: An explanation for branch performance?

2016-04-30 Thread Tom Marchant
On Sat, 30 Apr 2016 10:55:47 +0200, Bernd Oppolzer wrote:

>there are IBM and home grown macros which put their parameter lists
>inline and branch around them

Use the List and Execute forms. The standard forms are cause the I-cache to 
have to be re-read because of the Store In Instruction Stream.

-- 
Tom Marchant

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


An explanation for branch performance?

2016-04-30 Thread Peter Relson
To summarize,

Whether using base-displacement or relative branches, the three test 
programs being discussed are, in effect:

(branch never taken, best)

TEST CSECT 
 LLILF 4,1000*1000*1000 
 LTR   4,4 
NEXT JNP   NEXT1 
NEXT1JCT   4,NEXT 
 BR14 
 END   TEST 

and (conditional branch, always taken, worst)

TEST CSECT 
 LLILF 4,1000*1000*1000 
 LTR   4,4 
NEXT JP   NEXT1 
NEXT1JCT   4,NEXT 
 BR14 
 END   TEST 

and (unconditional branch, always taken, middle)

TEST CSECT 
 LLILF 4,1000*1000*1000 
 LTR   4,4 
NEXT J   NEXT1 
NEXT1JCT   4,NEXT 
 BR14 
 END   TEST  

Peter Relson
z/OS Core Technology Design


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-30 Thread Bernd Oppolzer

Am 29.04.2016 um 18:59 schrieb Jim Mulder:

No. It's the opposite which is why I originally posted. The
unconditional branch is slower and I want to know why.

   The relevant comparison is not conditional branch vs.
unconditional branch.  It is branch not taken vs. branch taken.
Sequential execution is always best.  Branch prediction tries to
mitigate some of the effects of nonsequential execution.

Jim Mulder   z/OS System Test   IBM Corp.  Poughkeepsie,  NY


IMO, the interesting point about this (for ASSEMBLER programmers and 
legacy code) is:


there are IBM and home grown macros which put their parameter lists
inline and branch around them, although there are other possibilities
(separating parameters from the execution path using MF=L, MF=E or
register notation etc.). Such macros should be avoided in tight loops.

An example which occured to me was a LOOP macro of a SP macro package,
where the loop control variable was a packed decimal variable defined in 
the
instruction stream and branched over. This is even worse, I believe, 
because the
I-cache is invalidated by stores into (or computations using) the loop 
control variable.
I changed this, when I made the whole package sensitive to a global 
variable that signalled
"no base register present" (AKA baseless); then the variable is put in a 
well-known

DSECT for auto variables, which is defined in the module's startup macro.

Kind regards

Bernd

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Tony Harminc
On 29 April 2016 at 12:59, Jim Mulder  wrote:
>   The relevant comparison is not conditional branch vs.
> unconditional branch.  It is branch not taken vs. branch taken.
> Sequential execution is always best.  Branch prediction tries to
> mitigate some of the effects of nonsequential execution.

Right. And presumably even an "unconditional" branch that is actually
a branch on condition with a CC mask of 15 can be mispredicted, in
theory. And therefore the instruction fetch stream at the address
following the branch will keep on fetching, even though it fully
expects to switch to the new stream at the branch target address. And
then that first stream will have to be thrown away, which presumably
isn't free.

In the tiny example case, the stream after the branch is not only a
valid instruction (which it might or might not be in the case of
branching over an eyecatcher), but it's another branch, which
presumably gets predicted in its own right, even though it's at the
same address as the first branch's target.

Now what about a truly unconditional branch, i.e. one that doesn't
depend on the condition code? Would BRAS perform better in this regard
than even J, or is the cost of saving stuff in r1 very high? Surely
there is no reason to continue fetching after such an instruction. It
can't not branch, and it can't program check.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Jim Mulder
> >> Good point well made but can you explain why changing a B to a BE
> in a tight loop results in 43% difference?
> > But aren't those two completely different cases (even if it is the
> same instruction)? The first is an unconditional branch, the second 
> one a conditional branch. That probably makes a big difference to 
> the processor. Where I expect the unconditional one to be faster 
> than the conditional one.
> >
> > I assume it is 43% faster than the conditional one? If it is the 
> other way around I will be very surprised as well.
> 
> No. It's the opposite which is why I originally posted. The 
> unconditional branch is slower and I want to know why.

  The relevant comparison is not conditional branch vs.
unconditional branch.  It is branch not taken vs. branch taken.
Sequential execution is always best.  Branch prediction tries to
mitigate some of the effects of nonsequential execution. 

Jim Mulder   z/OS System Test   IBM Corp.  Poughkeepsie,  NY



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Windt, W.K.F. van der (Fred)
>> I assume it is 43% faster than the conditional one? If it is the other way 
>> around I will be very surprised as well.
>
> No. It's the opposite which is why I originally posted. The unconditional 
> branch is slower and I want to know why.

That's probably 'a branch not taken is always faster than a branch taken (even 
if predicted correctly).

But by now I've lost the start of this discussion: there isn't really a way to 
choose between a B or BE, you either need a conditional branch or you need an 
unconditional one. So it doesn't seem that relevant if one is faster than the 
other.

And if we are talking about jumping over eye catchers: that is always gonna be 
an unconditional branch. Removing the branch all together by moving the 
eyecatcher or the entry point must be an improvement. But that is a different 
discussion. Am I missing something?

Fred!
--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Tony Harminc
On 29 April 2016 at 12:06, David Crayford  wrote:
> On 29/04/2016 11:55 PM, Tony Harminc wrote:
>>
>> On 29 April 2016 at 11:50, Charles Mills  wrote:
>>>
>>> What about substituting a branch relative for the branch on base
>>> register? Trivial code change to make.
>>
>> I was about to suggest that too. All the IBM published material I've
>> seen on this suggests that
>
>
> It was a simple test case to exercise our legacy code base which issue
> non-relative unconditional branches over eye-catchers. It's non-trivial to
> change and test a huge code base  and certainly you wouldn't want to without
> understanding the problem.  But there's certainly an issue as our customers
> have reported.

You quoted me, but snipped out everything I actually said... Which I
think does partly address your question.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 30/04/2016 12:29 AM, Windt, W.K.F. van der (Fred) wrote:

Good point well made but can you explain why changing a B to a BE in a tight 
loop results in 43% difference?

But aren't those two completely different cases (even if it is the same 
instruction)? The first is an unconditional branch, the second one a 
conditional branch. That probably makes a big difference to the processor. 
Where I expect the unconditional one to be faster than the conditional one.

I assume it is 43% faster than the conditional one? If it is the other way 
around I will be very surprised as well.


No. It's the opposite which is why I originally posted. The 
unconditional branch is slower and I want to know why.



Fred!
--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Windt, W.K.F. van der (Fred)
>
> Good point well made but can you explain why changing a B to a BE in a tight 
> loop results in 43% difference?

But aren't those two completely different cases (even if it is the same 
instruction)? The first is an unconditional branch, the second one a 
conditional branch. That probably makes a big difference to the processor. 
Where I expect the unconditional one to be faster than the conditional one.

I assume it is 43% faster than the conditional one? If it is the other way 
around I will be very surprised as well.

Fred!
--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 30/04/2016 12:23 AM, Windt, W.K.F. van der (Fred) wrote:

Sent from my new iPad

On 29 Apr 2016, at 18:10, Peter Relson  wrote:

Since the origin for the starting post apparently lay in branching around
the eyecatcher (which really is not necessarily at all the same as a
branch in a 2 instruction loop), I was surprised that none of the posts
that I glanced at mentioned Instruction-cache misses.

Just because you think something is high frequency does not mean that the
operating system or the machine agrees with you. If the module's first
instruction is not in I-cache, then whether that first instruction is a
branch or anything else, it will show up as a lot hotter than a somewhat
similar instruction that is in the I-cache.

That's an interesting point: does this mean that every instruction at the start 
of an 256 byte boundary will probably appear hotter because it sits at the 
start of a cache line and profilers will attribute the time it takes to load 
the line into cache to this instruction? The other instructions in the cached 
line quietly benefit from this behavior.


That's a good question. Is there a doubt that profilers are unreliable? 
Jim Mulder gave a good explanation where that is certainly the case in a 
previous post.



You could probably validate this by inserting NOPS in the code. The hotspots 
should shift by the length of the NOP.

And the entry point of a module might be more likely to sit on a 256 byte 
boundary if it is also at the very start of the code of that module...

Very interesting.

Fred!

ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Windt, W.K.F. van der (Fred)
Sent from my new iPad
> On 29 Apr 2016, at 18:10, Peter Relson  wrote:
>
> Since the origin for the starting post apparently lay in branching around
> the eyecatcher (which really is not necessarily at all the same as a
> branch in a 2 instruction loop), I was surprised that none of the posts
> that I glanced at mentioned Instruction-cache misses.
>
> Just because you think something is high frequency does not mean that the
> operating system or the machine agrees with you. If the module's first
> instruction is not in I-cache, then whether that first instruction is a
> branch or anything else, it will show up as a lot hotter than a somewhat
> similar instruction that is in the I-cache.

That's an interesting point: does this mean that every instruction at the start 
of an 256 byte boundary will probably appear hotter because it sits at the 
start of a cache line and profilers will attribute the time it takes to load 
the line into cache to this instruction? The other instructions in the cached 
line quietly benefit from this behavior.

You could probably validate this by inserting NOPS in the code. The hotspots 
should shift by the length of the NOP.

And the entry point of a module might be more likely to sit on a 256 byte 
boundary if it is also at the very start of the code of that module...

Very interesting.

Fred!

ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 30/04/2016 12:10 AM, Peter Relson wrote:

Since the origin for the starting post apparently lay in branching around
the eyecatcher (which really is not necessarily at all the same as a
branch in a 2 instruction loop), I was surprised that none of the posts
that I glanced at mentioned Instruction-cache misses.


Good point well made but can you explain why changing a B to a BE in a 
tight loop results in 43% difference?



Just because you think something is high frequency does not mean that the
operating system or the machine agrees with you. If the module's first
instruction is not in I-cache, then whether that first instruction is a
branch or anything else, it will show up as a lot hotter than a somewhat
similar instruction that is in the I-cache.


Indeed. But why would there be a slowdown on a z13 when compared to 
older hardware?



Peter Relson
z/OS Core Technology Design


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Peter Relson
Since the origin for the starting post apparently lay in branching around 
the eyecatcher (which really is not necessarily at all the same as a 
branch in a 2 instruction loop), I was surprised that none of the posts 
that I glanced at mentioned Instruction-cache misses.

Just because you think something is high frequency does not mean that the 
operating system or the machine agrees with you. If the module's first 
instruction is not in I-cache, then whether that first instruction is a 
branch or anything else, it will show up as a lot hotter than a somewhat 
similar instruction that is in the I-cache. 

Peter Relson
z/OS Core Technology Design


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 11:55 PM, Tony Harminc wrote:

On 29 April 2016 at 11:50, Charles Mills  wrote:

What about substituting a branch relative for the branch on base register? 
Trivial code change to make.

I was about to suggest that too. All the IBM published material I've
seen on this suggests that


It was a simple test case to exercise our legacy code base which issue 
non-relative unconditional branches over eye-catchers. It's non-trivial 
to change and test a huge code base  and certainly you wouldn't want to 
without understanding the problem.  But there's certainly an issue as 
our customers have reported.



-- all else being equal  (heh), untaken branches are faster than taken
ones (even if the prediction is correct),
and
-- the penalty for misprediction is lower for relative branches than
register-based ones.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Tony Harminc
On 29 April 2016 at 11:50, Charles Mills  wrote:
> What about substituting a branch relative for the branch on base register? 
> Trivial code change to make.

I was about to suggest that too. All the IBM published material I've
seen on this suggests that

-- all else being equal  (heh), untaken branches are faster than taken
ones (even if the prediction is correct),
and
-- the penalty for misprediction is lower for relative branches than
register-based ones.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Mike Schwab
On Fri, Apr 29, 2016 at 9:30 AM, John McKown
 wrote:
> On Fri, Apr 29, 2016 at 9:27 AM, Mike Schwab 
> wrote:
>
>> Well, the obvious solution is to code the eyecatcher literals before
>> the entry point.  It will be less obvious that the eyecatcher is part
>> of the program (and not the end of the previous program) but as the
>> technique become more widespread it should become more trusted.
>>
>>
> IBM has a ton of recoding to do. I've seen this type of thing in a _lot_
> of IBM routines.
>
> --
> The unfacts, did we have them, are too imprecisely few to warrant our
> certitude.
>
> Maranatha! <><
> John McKown
>

https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe_ID=87642

-- 
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


AW: Re: An explanation for branch performance?

2016-04-29 Thread Peter Hunkeler



>>
>>   L R4,=A(1*1000*1000*1000)
>>   LTR   R4,R4
>>   J LOOP
>> *
>> LOOP DS   0D  .LOOP START
>>   B NEXT
>>
>> NEXT JCT   R4,LOOP
>
> The loop starts with a branch ... I tested it twice - when the CC is matched
> (branch happens) and when it is not matched (falls through)
>
> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
> 2. When the CC is not matched the code falls through, CPU TIME=1.69
> seconds - a reduction of 42%

> Uhm... I don't see any conditional branch at the start of the loop that 
> branches or falls through?


I'm with Fred here.  Out of curiosity, the code you posted seems to be 
incomplete. Is it?


--


Peter Hunkeler



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 11:40 PM, Jim Mulder wrote:

Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.



​IBM has a ton of recoding to do. I've seen this type of thing in a

_lot_

of IBM routines.​


Indeed! Including SVC routines which show up in our profiling!

   How is the sampling done for your profiling?  Keep in mind that
sampling which is software-based and driven by external interrupts
will charge the time spent in the SVC interrupt handler to the
beginning of the SVC routine, since the SVC interrupt handler is
disabled for external interrupts.



We use IBM Application Performance Analyzer. Thank you, that's very good 
information.




Jim Mulder   z/OS System Test   IBM Corp.  Poughkeepsie,  NY


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Charles Mills
I think all issues of this general type are incredibly difficult to analyze 
reliably because the hardware is so darned complex now. There are so many more 
variables than back in the day when you could say "a branch consumes 'n' 
microseconds" or "'n' microseconds if taken, 'm' microseconds if not."

What about substituting a branch relative for the branch on base register? 
Trivial code change to make.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Jim Mulder
Sent: Friday, April 29, 2016 8:41 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: An explanation for branch performance?

> >> Well, the obvious solution is to code the eyecatcher literals 
> >> before the entry point.  It will be less obvious that the 
> >> eyecatcher is part of the program (and not the end of the previous 
> >> program) but as the technique become more widespread it should become more 
> >> trusted.
> >>
> >>
> > ​IBM has a ton of recoding to do. I've seen this type of thing in a
_lot_
> > of IBM routines.​
> >
> 
> Indeed! Including SVC routines which show up in our profiling!

  How is the sampling done for your profiling?  Keep in mind that sampling 
which is software-based and driven by external interrupts will charge the time 
spent in the SVC interrupt handler to the beginning of the SVC routine, since 
the SVC interrupt handler is disabled for external interrupts. 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Farley, Peter x23353
This is very interesting.  It explains what I thought was an anomaly of the CA 
TriTune product, which we use here for profiling our in-house code.  Strictly 
enforced standards here require inclusion of automatically customized code at 
the start of every COBOL procedure (main or subroutine) that DISPLAY's program 
name and version and compile date/time information at the first pass and 
bypasses the DISPLAY every other time using "ALTER ... GO TO" logic.  The 
"ALTER ... GO TO" paragraphs now show up as serious hot spots in high-frequency 
subroutines where it never did before since we moved to z13 CEC's.

Many thanks for this investigation.

Peter

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of David Crayford
Sent: Friday, April 29, 2016 10:48 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: An explanation for branch performance?

On 29/04/2016 10:34 PM, Joe Testa wrote:
> There seems to be little point worrying about the time needed to branch past 
> an eyecatcher at the start of a program, compared to the time used by the 
> rest of the program.

Unfortunately that's not true. For high frequency subroutines it can 
dominate the performance profile. We have customer feedback where the 
code has been profiled using APA and the hot spots are clearly at the 
branch over eye-cachers. The reason I'm asking the question is for a 
reason why? The customer suggested we were non re-entrant and saving 
registers into the instruction stream. Our code is re-entrant.
--


This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please notify us immediately by e-mail and delete the message and any 
attachments from your system.


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford
Try this on your hardware and post the results. Uncomment/Comment where 
applicable.


BENCHCSECT
BENCHAMODE 31
 SAVE  (14,12)
 LRR12,R15
 USING BENCH,R12
 L R4,=A(1*1000*1000*1000)
 LTR   R4,R4
 J LOOP
*
LOOP DS   0D  .LOOP START
*B NEXT   .UNCONDICTIONAL
 BENEXT   .CONDITIONAL
NEXT JCT   R4,LOOP

 RETURN (14,12)
 YREGS ,
 END


On 29/04/2016 9:35 PM, Windt, W.K.F. van der (Fred) wrote:

Here's the code.

I wrote a simple program - it tight loops 1 billion times


   L R4,=A(1*1000*1000*1000)
   LTR   R4,R4
   J LOOP
*
LOOP DS   0D  .LOOP START
   B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds - a reduction of 42%

Uhm... I don't see any conditional branch at the start of the loop that 
branches or falls through?

Fred!

--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford
Try this on your hardware and post the results. Uncomment/Comment where 
applicable.


BENCHCSECT
BENCHAMODE 31
 SAVE  (14,12)
 LRR12,R15
 USING BENCH,R12
 L R4,=A(1*1000*1000*1000)
 LTR   R4,R4
 J LOOP
*
LOOP DS   0D  .LOOP START
*B NEXT   .UNCONDICTIONAL
 BENEXT   .CONDITIONAL
NEXT JCT   R4,LOOP
 LRR15,0
 RETURN (14,12)
 YREGS ,
 END


On 29/04/2016 9:35 PM, Windt, W.K.F. van der (Fred) wrote:

Here's the code.

I wrote a simple program - it tight loops 1 billion times


   L R4,=A(1*1000*1000*1000)
   LTR   R4,R4
   J LOOP
*
LOOP DS   0D  .LOOP START
   B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds - a reduction of 42%

Uhm... I don't see any conditional branch at the start of the loop that 
branches or falls through?

Fred!

--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Jim Mulder
> >> Well, the obvious solution is to code the eyecatcher literals before
> >> the entry point.  It will be less obvious that the eyecatcher is part
> >> of the program (and not the end of the previous program) but as the
> >> technique become more widespread it should become more trusted.
> >>
> >>
> > ​IBM has a ton of recoding to do. I've seen this type of thing in a 
_lot_
> > of IBM routines.​
> >
> 
> Indeed! Including SVC routines which show up in our profiling!

  How is the sampling done for your profiling?  Keep in mind that
sampling which is software-based and driven by external interrupts
will charge the time spent in the SVC interrupt handler to the 
beginning of the SVC routine, since the SVC interrupt handler is 
disabled for external interrupts. 

Jim Mulder   z/OS System Test   IBM Corp.  Poughkeepsie,  NY


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Charles Mills
Plus it's really not an assembler question. The assembler is a program that 
turns 'B' into '47F0'. This is a hardware performance question.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of David Crayford
Sent: Friday, April 29, 2016 8:23 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: An explanation for branch performance?

On 29/04/2016 11:10 PM, Lizette Koehler wrote:
> Maybe the IBM Assembler List might be helpful here?
>
> If you have not joined, use this URL:
>   https://listserv.uga.edu/cgi-bin/wa?A0=ASSEMBLER-LIST
>
> Lizette

See my earlier response to Elardus. The assembler list is almost moribund. 
Everybody posts here with these kind of questions because the audience is much 
broader. Most of the old regulars from ASSEMBLER-LIST now converse on linkedin 
groups.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 11:10 PM, Lizette Koehler wrote:

Maybe the IBM Assembler List might be helpful here?

If you have not joined, use this URL:
https://listserv.uga.edu/cgi-bin/wa?A0=ASSEMBLER-LIST

Lizette


See my earlier response to Elardus. The assembler list is almost 
moribund. Everybody posts here with these kind of questions because the 
audience is much broader. Most of the old regulars from ASSEMBLER-LIST 
now converse on linkedin groups.



-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of David Crayford
Sent: Friday, April 29, 2016 7:55 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: An explanation for branch performance?

On 29/04/2016 10:27 PM, Mike Schwab wrote:

Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.

Thanks! We already know the solution. I looking for an answer. I'm a C/C++
coder by trade and Metal/C has a neat FPB control block for the eyecatchers
which are pointed to by an offset just above the entry point.

   ENTRY @@CCN@240
@@CCN@240 AMODE 31
   DCXL8'00C300C300D50100'   Function Entry Point Marker
   DCA(@@FPB@4-*+8)  Signed offset to FPB
   DCXL4''   Reserved
@@CCN@240 DS   0F

@@LIT@4  LTORG
@@FPB@   LOCTR
@@FPB@4  DS0F  Function Property Block
   DCXL2'CCD5'   Eyecatcher
   DCBL2'0011'   Saved GPR Mask
   DCA(@@PFD@@-@@FPB@4)  Signed Offset to Prefix Data
   DCBL1''   Flag Set 1
   DCBL1'1000'   Flag Set 2
   DCBL1''   Flag Set 3
   DCBL1'0001'   Flag Set 4
   DCXL4''   Reserved
   DCXL4''   Reserved
   DCAL2(12)
   DCC'avl_iter_cur'


On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote:

On 29/04/2016 10:09 PM, Mike Schwab wrote:

The pipeline is optimized for running many instructions in a row.  A
branch is not recognized until through a good part of the pipeline.
Meanwhile the data to be skipped is in the instruction pipeline.

Results meet expectations.

So branching over eyecatchers is expected to be x2 slower on a z13
than a z114? I was always lead to believe that new hardware always
ran old code faster unless it was doing nasty stuff like storing into
the instruction stream.



On Fri, Apr 29, 2016 at 7:40 AM, David Crayford
<dcrayf...@gmail.com>
wrote:

We're doing some performance work on our assembler code and one of
my colleagues ran the following test which was surprising.
Unconditional branching can add significant overhead. I always
believed that conditional branches were expensive because the
branch predictor needed to do more work and unconditional branches
were easy to predict. Does anybody have an explanation for this.
Our machine is z114. It appears that it's even worse on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


L R4,=A(1*1000*1000*1000)
LTR   R4,R4
J LOOP
*
LOOP DS   0D  .LOOP START
B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC
is matched (branch happens) and when it is not matched (falls
through)

1. When the CC is matched and branching happens, CPU TIME=2.94
seconds 2. When the CC is not matched the code falls through, CPU
TIME=1.69 seconds
- a reduction of 42%


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 10:27 PM, Mike Schwab wrote:

Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.


Thanks! We already know the solution. I looking for an answer. I'm a 
C/C++ coder by trade and Metal/C has a neat FPB control block for the 
eyecatchers which are pointed to by an offset just above the entry point.


 ENTRY @@CCN@240
@@CCN@240 AMODE 31
 DCXL8'00C300C300D50100'   Function Entry Point Marker
 DCA(@@FPB@4-*+8)  Signed offset to FPB
 DCXL4''   Reserved
@@CCN@240 DS   0F

@@LIT@4  LTORG
@@FPB@   LOCTR
@@FPB@4  DS0F  Function Property Block
 DCXL2'CCD5'   Eyecatcher
 DCBL2'0011'   Saved GPR Mask
 DCA(@@PFD@@-@@FPB@4)  Signed Offset to Prefix Data
 DCBL1''   Flag Set 1
 DCBL1'1000'   Flag Set 2
 DCBL1''   Flag Set 3
 DCBL1'0001'   Flag Set 4
 DCXL4''   Reserved
 DCXL4''   Reserved
 DCAL2(12)
 DCC'avl_iter_cur'


On Fri, Apr 29, 2016 at 9:13 AM, David Crayford  wrote:

On 29/04/2016 10:09 PM, Mike Schwab wrote:

The pipeline is optimized for running many instructions in a row.  A
branch is not recognized until through a good part of the pipeline.
Meanwhile the data to be skipped is in the instruction pipeline.

Results meet expectations.


So branching over eyecatchers is expected to be x2 slower on a z13 than a
z114? I was always lead to believe that new hardware always ran old code
faster unless it was doing nasty stuff like storing into the instruction
stream.



On Fri, Apr 29, 2016 at 7:40 AM, David Crayford 
wrote:

We're doing some performance work on our assembler code and one of my
colleagues ran the following test which was surprising. Unconditional
branching can add significant overhead. I always believed that
conditional
branches were expensive because the branch predictor needed to do more
work
and unconditional branches were easy to predict. Does anybody have an
explanation for this. Our machine is z114. It appears that it's even
worse
on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


   L R4,=A(1*1000*1000*1000)
   LTR   R4,R4
   J LOOP
*
LOOP DS   0D  .LOOP START
   B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is
matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds
- a reduction of 42%

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN





--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Lizette Koehler
Maybe the IBM Assembler List might be helpful here?

If you have not joined, use this URL:
https://listserv.uga.edu/cgi-bin/wa?A0=ASSEMBLER-LIST

Lizette

> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
> Behalf Of David Crayford
> Sent: Friday, April 29, 2016 7:55 AM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: Re: An explanation for branch performance?
> 
> On 29/04/2016 10:27 PM, Mike Schwab wrote:
> > Well, the obvious solution is to code the eyecatcher literals before
> > the entry point.  It will be less obvious that the eyecatcher is part
> > of the program (and not the end of the previous program) but as the
> > technique become more widespread it should become more trusted.
> 
> Thanks! We already know the solution. I looking for an answer. I'm a C/C++
> coder by trade and Metal/C has a neat FPB control block for the eyecatchers
> which are pointed to by an offset just above the entry point.
> 
>   ENTRY @@CCN@240
> @@CCN@240 AMODE 31
>   DCXL8'00C300C300D50100'   Function Entry Point Marker
>   DCA(@@FPB@4-*+8)  Signed offset to FPB
>   DCXL4''   Reserved
> @@CCN@240 DS   0F
> 
> @@LIT@4  LTORG
> @@FPB@   LOCTR
> @@FPB@4  DS0F  Function Property Block
>   DCXL2'CCD5'   Eyecatcher
>   DCBL2'0011'   Saved GPR Mask
>   DCA(@@PFD@@-@@FPB@4)  Signed Offset to Prefix Data
>   DCBL1''   Flag Set 1
>   DCBL1'1000'   Flag Set 2
>   DCBL1''   Flag Set 3
>   DCBL1'0001'   Flag Set 4
>   DCXL4''   Reserved
>   DCXL4''   Reserved
>   DCAL2(12)
>   DCC'avl_iter_cur'
> 
> > On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote:
> >> On 29/04/2016 10:09 PM, Mike Schwab wrote:
> >>> The pipeline is optimized for running many instructions in a row.  A
> >>> branch is not recognized until through a good part of the pipeline.
> >>> Meanwhile the data to be skipped is in the instruction pipeline.
> >>>
> >>> Results meet expectations.
> >>
> >> So branching over eyecatchers is expected to be x2 slower on a z13
> >> than a z114? I was always lead to believe that new hardware always
> >> ran old code faster unless it was doing nasty stuff like storing into
> >> the instruction stream.
> >>
> >>
> >>> On Fri, Apr 29, 2016 at 7:40 AM, David Crayford
> >>> <dcrayf...@gmail.com>
> >>> wrote:
> >>>> We're doing some performance work on our assembler code and one of
> >>>> my colleagues ran the following test which was surprising.
> >>>> Unconditional branching can add significant overhead. I always
> >>>> believed that conditional branches were expensive because the
> >>>> branch predictor needed to do more work and unconditional branches
> >>>> were easy to predict. Does anybody have an explanation for this.
> >>>> Our machine is z114. It appears that it's even worse on a z13.
> >>>>
> >>>> Here's the code.
> >>>>
> >>>> I wrote a simple program - it tight loops 1 billion times
> >>>>
> >>>>
> >>>>L R4,=A(1*1000*1000*1000)
> >>>>LTR   R4,R4
> >>>>J LOOP
> >>>> *
> >>>> LOOP DS   0D  .LOOP START
> >>>>B NEXT
> >>>>
> >>>> NEXT JCT   R4,LOOP
> >>>>
> >>>> The loop starts with a branch ... I tested it twice - when the CC
> >>>> is matched (branch happens) and when it is not matched (falls
> >>>> through)
> >>>>
> >>>> 1. When the CC is matched and branching happens, CPU TIME=2.94
> >>>> seconds 2. When the CC is not matched the code falls through, CPU
> >>>> TIME=1.69 seconds
> >>>> - a reduction of 42%
> >>>>

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 10:30 PM, John McKown wrote:

On Fri, Apr 29, 2016 at 9:27 AM, Mike Schwab 
wrote:


Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.



​IBM has a ton of recoding to do. I've seen this type of thing in a _lot_
of IBM routines.​



Indeed! Including SVC routines which show up in our profiling!





--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 9:35 PM, Windt, W.K.F. van der (Fred) wrote:

Here's the code.

I wrote a simple program - it tight loops 1 billion times


   L R4,=A(1*1000*1000*1000)
   LTR   R4,R4
   J LOOP
*
LOOP DS   0D  .LOOP START
   B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds - a reduction of 42%

Uhm... I don't see any conditional branch at the start of the loop that 
branches or falls through?


Modify that snippet toBENEXT   and see what your results are.


Fred!

--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 10:34 PM, Joe Testa wrote:

There seems to be little point worrying about the time needed to branch past an 
eyecatcher at the start of a program, compared to the time used by the rest of 
the program.


Unfortunately that's not true. For high frequency subroutines it can 
dominate the performance profile. We have customer feedback where the 
code has been profiled using APA and the hot spots are clearly at the 
branch over eye-cachers. The reason I'm asking the question is for a 
reason why? The customer suggested we were non re-entrant and saving 
registers into the instruction stream. Our code is re-entrant.




From: Mike Schwab
Sent: Friday, April 29, 2016 10:27 AM
Newsgroups: bit.listserv.ibm-main
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: An explanation for branch performance?

Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.

On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote:

On 29/04/2016 10:09 PM, Mike Schwab wrote:

The pipeline is optimized for running many instructions in a row.  A
branch is not recognized until through a good part of the pipeline.
Meanwhile the data to be skipped is in the instruction pipeline.

Results meet expectations.


So branching over eyecatchers is expected to be x2 slower on a z13 than a
z114? I was always lead to believe that new hardware always ran old code
faster unless it was doing nasty stuff like storing into the instruction
stream.



On Fri, Apr 29, 2016 at 7:40 AM, David Crayford <dcrayf...@gmail.com>
wrote:

We're doing some performance work on our assembler code and one of my
colleagues ran the following test which was surprising. Unconditional
branching can add significant overhead. I always believed that
conditional
branches were expensive because the branch predictor needed to do more
work
and unconditional branches were easy to predict. Does anybody have an
explanation for this. Our machine is z114. It appears that it's even
worse
on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


   L R4,=A(1*1000*1000*1000)
   LTR   R4,R4
   J LOOP
*
LOOP DS   0D  .LOOP START
   B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is
matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds
- a reduction of 42%

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN





--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread John McKown
On Fri, Apr 29, 2016 at 9:27 AM, Mike Schwab 
wrote:

> Well, the obvious solution is to code the eyecatcher literals before
> the entry point.  It will be less obvious that the eyecatcher is part
> of the program (and not the end of the previous program) but as the
> technique become more widespread it should become more trusted.
>
>
​IBM has a ton of recoding to do. I've seen this type of thing in a _lot_
of IBM routines.​



-- 
The unfacts, did we have them, are too imprecisely few to warrant our
certitude.

Maranatha! <><
John McKown

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Joe Testa
There seems to be little point worrying about the time needed to branch past an 
eyecatcher at the start of a program, compared to the time used by the rest of 
the program.


From: Mike Schwab 
Sent: Friday, April 29, 2016 10:27 AM
Newsgroups: bit.listserv.ibm-main
To: IBM-MAIN@LISTSERV.UA.EDU 
Subject: Re: An explanation for branch performance?

Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.

On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote:
> On 29/04/2016 10:09 PM, Mike Schwab wrote:
>>
>> The pipeline is optimized for running many instructions in a row.  A
>> branch is not recognized until through a good part of the pipeline.
>> Meanwhile the data to be skipped is in the instruction pipeline.
>>
>> Results meet expectations.
>
>
> So branching over eyecatchers is expected to be x2 slower on a z13 than a
> z114? I was always lead to believe that new hardware always ran old code
> faster unless it was doing nasty stuff like storing into the instruction
> stream.
>
>
>>
>> On Fri, Apr 29, 2016 at 7:40 AM, David Crayford <dcrayf...@gmail.com>
>> wrote:
>>>
>>> We're doing some performance work on our assembler code and one of my
>>> colleagues ran the following test which was surprising. Unconditional
>>> branching can add significant overhead. I always believed that
>>> conditional
>>> branches were expensive because the branch predictor needed to do more
>>> work
>>> and unconditional branches were easy to predict. Does anybody have an
>>> explanation for this. Our machine is z114. It appears that it's even
>>> worse
>>> on a z13.
>>>
>>> Here's the code.
>>>
>>> I wrote a simple program - it tight loops 1 billion times
>>>
>>>
>>>   L R4,=A(1*1000*1000*1000)
>>>   LTR   R4,R4
>>>   J LOOP
>>> *
>>> LOOP DS   0D  .LOOP START
>>>   B NEXT
>>>
>>> NEXT JCT   R4,LOOP
>>>
>>> The loop starts with a branch ... I tested it twice - when the CC is
>>> matched
>>> (branch happens) and when it is not matched (falls through)
>>>
>>> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
>>> 2. When the CC is not matched the code falls through, CPU TIME=1.69
>>> seconds
>>> - a reduction of 42%
>>>
>>> --
>>> For IBM-MAIN subscribe / signoff / archive access instructions,
>>> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>>
>>
>>
>
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



-- 
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Mike Schwab
Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.

On Fri, Apr 29, 2016 at 9:13 AM, David Crayford  wrote:
> On 29/04/2016 10:09 PM, Mike Schwab wrote:
>>
>> The pipeline is optimized for running many instructions in a row.  A
>> branch is not recognized until through a good part of the pipeline.
>> Meanwhile the data to be skipped is in the instruction pipeline.
>>
>> Results meet expectations.
>
>
> So branching over eyecatchers is expected to be x2 slower on a z13 than a
> z114? I was always lead to believe that new hardware always ran old code
> faster unless it was doing nasty stuff like storing into the instruction
> stream.
>
>
>>
>> On Fri, Apr 29, 2016 at 7:40 AM, David Crayford 
>> wrote:
>>>
>>> We're doing some performance work on our assembler code and one of my
>>> colleagues ran the following test which was surprising. Unconditional
>>> branching can add significant overhead. I always believed that
>>> conditional
>>> branches were expensive because the branch predictor needed to do more
>>> work
>>> and unconditional branches were easy to predict. Does anybody have an
>>> explanation for this. Our machine is z114. It appears that it's even
>>> worse
>>> on a z13.
>>>
>>> Here's the code.
>>>
>>> I wrote a simple program - it tight loops 1 billion times
>>>
>>>
>>>   L R4,=A(1*1000*1000*1000)
>>>   LTR   R4,R4
>>>   J LOOP
>>> *
>>> LOOP DS   0D  .LOOP START
>>>   B NEXT
>>>
>>> NEXT JCT   R4,LOOP
>>>
>>> The loop starts with a branch ... I tested it twice - when the CC is
>>> matched
>>> (branch happens) and when it is not matched (falls through)
>>>
>>> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
>>> 2. When the CC is not matched the code falls through, CPU TIME=1.69
>>> seconds
>>> - a reduction of 42%
>>>
>>> --
>>> For IBM-MAIN subscribe / signoff / archive access instructions,
>>> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>>
>>
>>
>
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



-- 
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 10:09 PM, Mike Schwab wrote:

The pipeline is optimized for running many instructions in a row.  A
branch is not recognized until through a good part of the pipeline.
Meanwhile the data to be skipped is in the instruction pipeline.

Results meet expectations.


So branching over eyecatchers is expected to be x2 slower on a z13 than 
a z114? I was always lead to believe that new hardware always ran old 
code faster unless it was doing nasty stuff like storing into the 
instruction stream.




On Fri, Apr 29, 2016 at 7:40 AM, David Crayford  wrote:

We're doing some performance work on our assembler code and one of my
colleagues ran the following test which was surprising. Unconditional
branching can add significant overhead. I always believed that conditional
branches were expensive because the branch predictor needed to do more work
and unconditional branches were easy to predict. Does anybody have an
explanation for this. Our machine is z114. It appears that it's even worse
on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


  L R4,=A(1*1000*1000*1000)
  LTR   R4,R4
  J LOOP
*
LOOP DS   0D  .LOOP START
  B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds
- a reduction of 42%

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN





--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Windt, W.K.F. van der (Fred)
> Here's the code.
> 
> I wrote a simple program - it tight loops 1 billion times
> 
> 
>   L R4,=A(1*1000*1000*1000)
>   LTR   R4,R4
>   J LOOP
> *
> LOOP DS   0D  .LOOP START
>   B NEXT
> 
> NEXT JCT   R4,LOOP
> 
> The loop starts with a branch ... I tested it twice - when the CC is matched
> (branch happens) and when it is not matched (falls through)
> 
> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
> 2. When the CC is not matched the code falls through, CPU TIME=1.69
> seconds - a reduction of 42%

Uhm... I don't see any conditional branch at the start of the loop that 
branches or falls through?

Fred!

--
ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Mike Schwab
The pipeline is optimized for running many instructions in a row.  A
branch is not recognized until through a good part of the pipeline.
Meanwhile the data to be skipped is in the instruction pipeline.

Results meet expectations.

On Fri, Apr 29, 2016 at 7:40 AM, David Crayford  wrote:
> We're doing some performance work on our assembler code and one of my
> colleagues ran the following test which was surprising. Unconditional
> branching can add significant overhead. I always believed that conditional
> branches were expensive because the branch predictor needed to do more work
> and unconditional branches were easy to predict. Does anybody have an
> explanation for this. Our machine is z114. It appears that it's even worse
> on a z13.
>
> Here's the code.
>
> I wrote a simple program - it tight loops 1 billion times
>
>
>  L R4,=A(1*1000*1000*1000)
>  LTR   R4,R4
>  J LOOP
> *
> LOOP DS   0D  .LOOP START
>  B NEXT
>
> NEXT JCT   R4,LOOP
>
> The loop starts with a branch ... I tested it twice - when the CC is matched
> (branch happens) and when it is not matched (falls through)
>
> 1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
> 2. When the CC is not matched the code falls through, CPU TIME=1.69 seconds
> - a reduction of 42%
>
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



-- 
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 9:08 PM, Vernooij, CP (ITOPT1) - KLM wrote:

I wonder if it is really a 'problem'. What kind of mormal program does this 
amount of branches back to where it just came from, messing up the entire 
pipeline. Is it efficient to optimize a z13 for this kind of programs? Isn't it 
better to optimize for the millions of other programs who are executed much 
more often than this one?


The issue was branching over eyecatchers in frequently called service 
routines which has proved to be a performance bottleneck on z13 hardware 
(and z114). That was a common idiom for legacy assembler code.



Kees.

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of David Crayford
Sent: 29 April, 2016 14:41
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: An explanation for branch performance?

We're doing some performance work on our assembler code and one of my
colleagues ran the following test which was surprising. Unconditional
branching can add significant overhead. I always believed that
conditional branches were expensive because the branch predictor needed
to do more work and unconditional branches were easy to predict. Does
anybody have an explanation for this. Our machine is z114. It appears
that it's even worse on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


   L R4,=A(1*1000*1000*1000)
   LTR   R4,R4
   J LOOP
*
LOOP DS   0D  .LOOP START
   B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is
matched (branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds - a reduction of 42%

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

For information, services and offers, please visit our web site: 
http://www.klm.com. This e-mail and any attachment may contain confidential and 
privileged material intended for the addressee only. If you are not the 
addressee, you are notified that no part of the e-mail or any attachment may be 
disclosed, copied or distributed, and that any other action related to this 
e-mail or attachment is strictly prohibited, and may be unlawful. If you have 
received this e-mail by error, please notify the sender immediately by return 
e-mail, and delete this message.

Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its 
employees shall not be liable for the incorrect or incomplete transmission of 
this e-mail or any attachments, nor responsible for any delay in receipt.
Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch 
Airlines) is registered in Amstelveen, The Netherlands, with registered number 
33014286




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Vernooij, CP (ITOPT1) - KLM
I wonder if it is really a 'problem'. What kind of mormal program does this 
amount of branches back to where it just came from, messing up the entire 
pipeline. Is it efficient to optimize a z13 for this kind of programs? Isn't it 
better to optimize for the millions of other programs who are executed much 
more often than this one?

Kees.

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of David Crayford
Sent: 29 April, 2016 14:41
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: An explanation for branch performance?

We're doing some performance work on our assembler code and one of my 
colleagues ran the following test which was surprising. Unconditional 
branching can add significant overhead. I always believed that 
conditional branches were expensive because the branch predictor needed 
to do more work and unconditional branches were easy to predict. Does 
anybody have an explanation for this. Our machine is z114. It appears 
that it's even worse on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


  L R4,=A(1*1000*1000*1000)
  LTR   R4,R4
  J LOOP
*
LOOP DS   0D  .LOOP START
  B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is 
matched (branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69 
seconds - a reduction of 42%

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

For information, services and offers, please visit our web site: 
http://www.klm.com. This e-mail and any attachment may contain confidential and 
privileged material intended for the addressee only. If you are not the 
addressee, you are notified that no part of the e-mail or any attachment may be 
disclosed, copied or distributed, and that any other action related to this 
e-mail or attachment is strictly prohibited, and may be unlawful. If you have 
received this e-mail by error, please notify the sender immediately by return 
e-mail, and delete this message. 

Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its 
employees shall not be liable for the incorrect or incomplete transmission of 
this e-mail or any attachments, nor responsible for any delay in receipt. 
Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch 
Airlines) is registered in Amstelveen, The Netherlands, with registered number 
33014286




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread David Crayford

On 29/04/2016 8:46 PM, Elardus Engelbrecht wrote:

David Crayford wrote:


We're doing some performance work on our assembler code and one of my 
colleagues ran the following test which was surprising. Unconditional branching 
can add significant overhead. I always believed that conditional branches were 
expensive because the branch predictor needed to do more work and unconditional 
branches were easy to predict. Does anybody have an explanation for this. Our 
machine is z114. It appears that it's even worse on a z13.

Hmmm, interesting, but I'm clueless about this one.

Perhaps someone lurking on Assembler-L can help you out? Could you post your 
message there?


Indeed! But most of the guys in the know hang out here, which is the 
epicenter of knowledge in our world. I do post on MVS-OE for Unix stuff 
because some of the IBM developers of USS only hang out there.




Groete / Greetings
Elardus Engelbrecht

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: An explanation for branch performance?

2016-04-29 Thread Elardus Engelbrecht
David Crayford wrote:

>We're doing some performance work on our assembler code and one of my 
>colleagues ran the following test which was surprising. Unconditional 
>branching can add significant overhead. I always believed that conditional 
>branches were expensive because the branch predictor needed to do more work 
>and unconditional branches were easy to predict. Does anybody have an 
>explanation for this. Our machine is z114. It appears that it's even worse on 
>a z13.

Hmmm, interesting, but I'm clueless about this one.

Perhaps someone lurking on Assembler-L can help you out? Could you post your 
message there?

Groete / Greetings
Elardus Engelbrecht

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


An explanation for branch performance?

2016-04-29 Thread David Crayford
We're doing some performance work on our assembler code and one of my 
colleagues ran the following test which was surprising. Unconditional 
branching can add significant overhead. I always believed that 
conditional branches were expensive because the branch predictor needed 
to do more work and unconditional branches were easy to predict. Does 
anybody have an explanation for this. Our machine is z114. It appears 
that it's even worse on a z13.


Here's the code.

I wrote a simple program - it tight loops 1 billion times


 L R4,=A(1*1000*1000*1000)
 LTR   R4,R4
 J LOOP
*
LOOP DS   0D  .LOOP START
 B NEXT

NEXT JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is 
matched (branch happens) and when it is not matched (falls through)


1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69 
seconds - a reduction of 42%


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN