Hi

I've been doing some preliminary experimentation for mul_basecase on Core i7 
nehalem ,and of course K8 and core2.

For the AMD chips , we are currently bound by the macro-op retirement rate , 
and I didn't think we could improve it . 
Currently for addmul1 and mul1 we have

mov 0,r3
mov (src),ax
mul cx
add r1,-1(dst)  // this is a mov for mul1
adc ax,r2
adc dx,r3

which is 7 op's for 1limb which leads to 2.333c/l+loopcontrol
and for addmul2 we have

mov 0,r4
mov (src),ax
mul cx
add r1,-1(dst)
adc ax,r2
adc dx,r3
mov 1(src),ax
mul bx
add ax,r2
adc dx,r3
adc 0,r4

which is 13 op's for 2 limbs which leads to 2.166c/l+loop control

For addmul1 and mul1 we can get a perfect schedule and with 4-way unroll we 
get 2.5c/l , this is optimal for K8 as add reg,(dst) has a max thruput of 
2.5c , on the K10 we dont have this restriction and with a larger unroll and 
perfect scheduling we can improve things. I've not tried this approach as you 
would have to go to 7-way unroll to get anything better than 2.5c/l
For mul1 it is possible to reduce the instruction count down to 2c/l+epsilon 
like this

mov (src),ax
mul cx
mov ax,r8
mov dx,1(dst)

mov 1(src),ax
mul cx
mov ax,r9
mov dx,2(dst)

mov 2(src),ax
mul cx
mov ax,r10
mov dx,3(dst)

mov 3(src),ax
mul cx
#mov ax,r11
mov dx,4(dst)

add r12,r12

adc r8,(dst)
adc r9,1(dst)
adc r10,2(dst)
adc ax,3(dst)

sbb r12,r12

add 4,count
jne loop

which is 27 ops for 4 limbs = 2.25c/l for mul_1 on K10 , but the best I could 
get is 5c/l .Its hardly surprising given how many "rules" the above breaks.
Now if only mul didn't change the carry flag we could do this

mov (src),ax
mul cx
adc ax,r8
mov r8,(dst)
mov dx,r8

for a mul1 , not restricting the product to dx:ax would help as well. Hey AMD 
for your next SSE extention do this. This is gives us a lower bound for c/l , 
as the minimum of thruput of mul and latency of adc. This is for a "single 
pass" implementation.

For mul_2 , a nearly perfect scheduled 3-way unroll gets us 2.333c/l and a 
perfect scheduled 4-way unroll gives us 2.25c/l
For addmul_2 a 4-way unroll gets us 2.375c/l as we lose a 1-cycle because of 
the extra dependences , to get better we have to unroll a bit more (if the 
reorder stack is big enough?) , I've not tried it yet.

Clearly we can also try addmul_3 or 4 etc , however we need an extra add (I 
keep wondering if there is a better way) as the subproduct is bigger and this 
cancels out some of the advantage gained.

For the Core2 and penryn the bottleneck is the mul thruput of 4c , I have a 
addmul_1 running at 4c/l but it has a very deep pipeline so for mul_basecase 
it is slower than the AMD like one , it isn't faster until your doing about 
20limbs.  Now addmul_1 and 2 also have the bottleneck of 2 adc latencys = 
4c , whereas mul_1 and have only 1adc+1add=3c . This explains why we allready 
have mul_1 and 2 running at 4c/l but addmul is harder.

For nehalem , I have heard than the mul thruput is 1c . it could be true , the 
best I could confirm is 2c , however it's academic , the adc latency is still 
2c . So for mul_1 or 2 I have 3.1c/l and the "optimal" is 3c/l  and for 
addmul_1 or 2 I got about 3.8c/l (so far)
For ALU ops the nehalem can complete at most 3 per cycle , and we need at 
least 1mul+1adc+1add per limb which is 3+2+1=6 ops =2c/l and our current best 
addmul_X would need 1mul+2adc+1add =8ops=2.66c/l . This is assuming mul and 
adc are similar to Core2.
The adc has a measured latency of 2c and 2 micro-ops would explain it (it 
would also match the best speed I got for mpn_addadd). The 3-micro-ops for 
mul is what Core2 has , but it seems reasonable. I would say 2.66c/l is the 
best possible on current evidence.I will give a definitive answer in 10 
years :)

So for AMD it's just a matter of using the higher addmul_X's to reduce the 
instruction count and finding a perfect schedule.Higher unrolling is limited 
by the reorder depth  buffer.

For core2,penryn we are nearly at the provable optimum , but I'm not too sure 
what is stopping the last bit .

Nehalem is interesting :) , but I'm a bit disappointed . Oh and for netburst , 
who cares!

For mul_basecase overheads are also very important , consider the current 
mul_basecase(based on addmul_1 , not in SVN) compared with my first attempt , 
the current one is about 20% faster. The first one had just an optimized 
inner loop , whereas the current one is optimized all over( it's almost a 
perfect schedule) 

Jason

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to