The 4-way unroll as below was limited by the pick hardware to 2.5c/l , which  
was achieved on the K10_2 but not the K10 which run at 5c/l . For a 7-way 
unroll the pick hardware has a limit of 2.41c/l , but we can get only 
2.5c/l , but we get this for both the K10_2 and the K10 . Why is this 
strange ? In the 7 way unroll we have a string of 7 consecutive adc reg to 
memory , which it manages to do the store forwarding , but when we have only 
4 of them (for the 4-way unroll) the K10 fails to do the required store 
forwarding, and therefore it runs like a dog. I suspect the store forwarding 
get confused by the tighter loop , and plays it safe.

Jason

On Monday 24 August 2009 20:44:34 Bill Hart wrote:
> Cool. I had wondered if it might be related to cache or memory
> architecture. I presume that is what you are referring to?
>
> Bill.
>
> 2009/8/24 jason <[email protected]>:
> > On Jul 23, 10:59 pm, Jason Moxham <[email protected]> wrote:
> >> Hi
> >>
> >> I've been doing some preliminary experimentation for mul_basecase on
> >> Core i7 nehalem ,and of course K8 and core2.
> >>
> >> For the AMD chips , we are currently bound by the macro-op retirement
> >> rate , and I didn't think we could improve it .
> >> Currently for addmul1 and mul1 we have
> >>
> >> mov 0,r3
> >> mov (src),ax
> >> mul cx
> >> add r1,-1(dst)  // this is a mov for mul1
> >> adc ax,r2
> >> adc dx,r3
> >>
> >> which is 7 op's for 1limb which leads to 2.333c/l+loopcontrol
> >> and for addmul2 we have
> >>
> >> mov 0,r4
> >> mov (src),ax
> >> mul cx
> >> add r1,-1(dst)
> >> adc ax,r2
> >> adc dx,r3
> >> mov 1(src),ax
> >> mul bx
> >> add ax,r2
> >> adc dx,r3
> >> adc 0,r4
> >>
> >> which is 13 op's for 2 limbs which leads to 2.166c/l+loop control
> >>
> >> For addmul1 and mul1 we can get a perfect schedule and with 4-way unroll
> >> we get 2.5c/l , this is optimal for K8 as add reg,(dst) has a max
> >> thruput of 2.5c , on the K10 we dont have this restriction and with a
> >> larger unroll and perfect scheduling we can improve things. I've not
> >> tried this approach as you would have to go to 7-way unroll to get
> >> anything better than 2.5c/l For mul1 it is possible to reduce the
> >> instruction count down to 2c/l+epsilon like this
> >>
> >> mov (src),ax
> >> mul cx
> >> mov ax,r8
> >> mov dx,1(dst)
> >>
> >> mov 1(src),ax
> >> mul cx
> >> mov ax,r9
> >> mov dx,2(dst)
> >>
> >> mov 2(src),ax
> >> mul cx
> >> mov ax,r10
> >> mov dx,3(dst)
> >>
> >> mov 3(src),ax
> >> mul cx
> >> #mov ax,r11
> >> mov dx,4(dst)
> >>
> >> add r12,r12
> >>
> >> adc r8,(dst)
> >> adc r9,1(dst)
> >> adc r10,2(dst)
> >> adc ax,3(dst)
> >>
> >> sbb r12,r12
> >>
> >> add 4,count
> >> jne loop
> >>
> >> which is 27 ops for 4 limbs = 2.25c/l for mul_1 on K10 , but the best I
> >> could get is 5c/l .Its hardly surprising given how many "rules" the
> >> above breaks.
> >
> > The New AMD Phenom II (we will call this chip K10_2) runs this code at
> > 2.5c/l , this suggests they have not changed the pick hardware , but
> > have improved the store forwarding (much more like Intel's). Our
> > mpn_addadd,addsub,subadd all now run at the predicted optimal speed on
> > the new K10_2 , same for addlsh1 and some others.
> >
> > Jason
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to