[mpir-devel] Re: x86_64 assembler multiplication

Bill Hart Mon, 24 Aug 2009 12:44:39 -0700

Cool. I had wondered if it might be related to cache or memory
architecture. I presume that is what you are referring to?


Bill.

2009/8/24 jason <[email protected]>:
>
> On Jul 23, 10:59 pm, Jason Moxham <[email protected]> wrote:
>> Hi
>>
>> I've been doing some preliminary experimentation for mul_basecase on Core i7
>> nehalem ,and of course K8 and core2.
>>
>> For the AMD chips , we are currently bound by the macro-op retirement rate ,
>> and I didn't think we could improve it .
>> Currently for addmul1 and mul1 we have
>>
>> mov 0,r3
>> mov (src),ax
>> mul cx
>> add r1,-1(dst)  // this is a mov for mul1
>> adc ax,r2
>> adc dx,r3
>>
>> which is 7 op's for 1limb which leads to 2.333c/l+loopcontrol
>> and for addmul2 we have
>>
>> mov 0,r4
>> mov (src),ax
>> mul cx
>> add r1,-1(dst)
>> adc ax,r2
>> adc dx,r3
>> mov 1(src),ax
>> mul bx
>> add ax,r2
>> adc dx,r3
>> adc 0,r4
>>
>> which is 13 op's for 2 limbs which leads to 2.166c/l+loop control
>>
>> For addmul1 and mul1 we can get a perfect schedule and with 4-way unroll we
>> get 2.5c/l , this is optimal for K8 as add reg,(dst) has a max thruput of
>> 2.5c , on the K10 we dont have this restriction and with a larger unroll and
>> perfect scheduling we can improve things. I've not tried this approach as you
>> would have to go to 7-way unroll to get anything better than 2.5c/l
>> For mul1 it is possible to reduce the instruction count down to 2c/l+epsilon
>> like this
>>
>> mov (src),ax
>> mul cx
>> mov ax,r8
>> mov dx,1(dst)
>>
>> mov 1(src),ax
>> mul cx
>> mov ax,r9
>> mov dx,2(dst)
>>
>> mov 2(src),ax
>> mul cx
>> mov ax,r10
>> mov dx,3(dst)
>>
>> mov 3(src),ax
>> mul cx
>> #mov ax,r11
>> mov dx,4(dst)
>>
>> add r12,r12
>>
>> adc r8,(dst)
>> adc r9,1(dst)
>> adc r10,2(dst)
>> adc ax,3(dst)
>>
>> sbb r12,r12
>>
>> add 4,count
>> jne loop
>>
>> which is 27 ops for 4 limbs = 2.25c/l for mul_1 on K10 , but the best I could
>> get is 5c/l .Its hardly surprising given how many "rules" the above breaks.
>
> The New AMD Phenom II (we will call this chip K10_2) runs this code at
> 2.5c/l , this suggests they have not changed the pick hardware , but
> have improved the store forwarding (much more like Intel's). Our
> mpn_addadd,addsub,subadd all now run at the predicted optimal speed on
> the new K10_2 , same for addlsh1 and some others.
>
> Jason
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: x86_64 assembler multiplication

Reply via email to