Hi I've been doing some preliminary experimentation for mul_basecase on Core i7 nehalem ,and of course K8 and core2.
For the AMD chips , we are currently bound by the macro-op retirement rate , and I didn't think we could improve it . Currently for addmul1 and mul1 we have mov 0,r3 mov (src),ax mul cx add r1,-1(dst) // this is a mov for mul1 adc ax,r2 adc dx,r3 which is 7 op's for 1limb which leads to 2.333c/l+loopcontrol and for addmul2 we have mov 0,r4 mov (src),ax mul cx add r1,-1(dst) adc ax,r2 adc dx,r3 mov 1(src),ax mul bx add ax,r2 adc dx,r3 adc 0,r4 which is 13 op's for 2 limbs which leads to 2.166c/l+loop control For addmul1 and mul1 we can get a perfect schedule and with 4-way unroll we get 2.5c/l , this is optimal for K8 as add reg,(dst) has a max thruput of 2.5c , on the K10 we dont have this restriction and with a larger unroll and perfect scheduling we can improve things. I've not tried this approach as you would have to go to 7-way unroll to get anything better than 2.5c/l For mul1 it is possible to reduce the instruction count down to 2c/l+epsilon like this mov (src),ax mul cx mov ax,r8 mov dx,1(dst) mov 1(src),ax mul cx mov ax,r9 mov dx,2(dst) mov 2(src),ax mul cx mov ax,r10 mov dx,3(dst) mov 3(src),ax mul cx #mov ax,r11 mov dx,4(dst) add r12,r12 adc r8,(dst) adc r9,1(dst) adc r10,2(dst) adc ax,3(dst) sbb r12,r12 add 4,count jne loop which is 27 ops for 4 limbs = 2.25c/l for mul_1 on K10 , but the best I could get is 5c/l .Its hardly surprising given how many "rules" the above breaks. Now if only mul didn't change the carry flag we could do this mov (src),ax mul cx adc ax,r8 mov r8,(dst) mov dx,r8 for a mul1 , not restricting the product to dx:ax would help as well. Hey AMD for your next SSE extention do this. This is gives us a lower bound for c/l , as the minimum of thruput of mul and latency of adc. This is for a "single pass" implementation. For mul_2 , a nearly perfect scheduled 3-way unroll gets us 2.333c/l and a perfect scheduled 4-way unroll gives us 2.25c/l For addmul_2 a 4-way unroll gets us 2.375c/l as we lose a 1-cycle because of the extra dependences , to get better we have to unroll a bit more (if the reorder stack is big enough?) , I've not tried it yet. Clearly we can also try addmul_3 or 4 etc , however we need an extra add (I keep wondering if there is a better way) as the subproduct is bigger and this cancels out some of the advantage gained. For the Core2 and penryn the bottleneck is the mul thruput of 4c , I have a addmul_1 running at 4c/l but it has a very deep pipeline so for mul_basecase it is slower than the AMD like one , it isn't faster until your doing about 20limbs. Now addmul_1 and 2 also have the bottleneck of 2 adc latencys = 4c , whereas mul_1 and have only 1adc+1add=3c . This explains why we allready have mul_1 and 2 running at 4c/l but addmul is harder. For nehalem , I have heard than the mul thruput is 1c . it could be true , the best I could confirm is 2c , however it's academic , the adc latency is still 2c . So for mul_1 or 2 I have 3.1c/l and the "optimal" is 3c/l and for addmul_1 or 2 I got about 3.8c/l (so far) For ALU ops the nehalem can complete at most 3 per cycle , and we need at least 1mul+1adc+1add per limb which is 3+2+1=6 ops =2c/l and our current best addmul_X would need 1mul+2adc+1add =8ops=2.66c/l . This is assuming mul and adc are similar to Core2. The adc has a measured latency of 2c and 2 micro-ops would explain it (it would also match the best speed I got for mpn_addadd). The 3-micro-ops for mul is what Core2 has , but it seems reasonable. I would say 2.66c/l is the best possible on current evidence.I will give a definitive answer in 10 years :) So for AMD it's just a matter of using the higher addmul_X's to reduce the instruction count and finding a perfect schedule.Higher unrolling is limited by the reorder depth buffer. For core2,penryn we are nearly at the provable optimum , but I'm not too sure what is stopping the last bit . Nehalem is interesting :) , but I'm a bit disappointed . Oh and for netburst , who cares! For mul_basecase overheads are also very important , consider the current mul_basecase(based on addmul_1 , not in SVN) compared with my first attempt , the current one is about 20% faster. The first one had just an optimized inner loop , whereas the current one is optimized all over( it's almost a perfect schedule) Jason --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---
