Just to be clear. You can use the mul_1 code that I posted. It is
yours and my code. If you can figure out how to get 2.5 c/l from
addmul_1 using my heuristics, go for it.

By the way, you can move the count increment up one instruction in
addmul_1 as you suggested (and modify the line that follows
appropriately), but it actually slows mul_1 down, for reasons I have
not figured out yet!

Bill.

2008/11/26  <[EMAIL PROTECTED]>:
>
> On Wednesday 26 November 2008 14:55:09 Bill Hart wrote:
>> Ah, this probably won't make that much difference to pverall
>> performance. Here is why:
>>
>
> Gosh , Everything happens at once...
>
> AMD have codeanaylst software which I think shows you pipeline details? , it's
> a free download , but I've not been able to compile it.
>
>> In rearranging the instructions in this way we have had to mix up the
>> instructions in an unrolled loop. That means that one can't just jump
>> into the loop at the required spot as before. The wind up and wind
>> down code needs to be made more complex. This is fine, but it possibly
>> adds a few cycles for small sizes.
>>
>> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
>> that mul_basecase switches over to Karatsuba after about 30 limbs on
>> the Opteron.
>>
>> But it also probably takes a number of iterations of the loop before
>> the hardware settles into a pattern. The data cache hardware needs to
>> prime, the branch prediction needs to prime, the instruction cache
>> needs to prime and the actual picking of instructions in the correct
>> order does not necessarily happen on the first iteration of the loop.
>>
>> I might be overstating the case a little. Perhaps by about 8 limbs you
>> win, I don't know.
>>
>> Anyhow, I believe jason (not Martin) is working on getting fully
>> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
>> has actually done all the really hard work here with the initial
>> scheduling to get down to 2.75 c/l, I'll let him post any performance
>> figures once he is done with the code. He deserves the credit!
>>
>
> I'll do a mul_basecase(which is what really counts) as well , by the weekend ,
> and I have some other ideas , which may pan out.
>
>
>> Bill.
>>
>> 2008/11/26 mabshoff <[EMAIL PROTECTED]>:
>> > On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
>> >> Some other things I forgot to mention:
>> >>
>> >> 1) It probably wouldn't have been possible for me to get 2.5c/l
>> >> without jason's code, in both the mul_1 and addmul_1 cases.
>> >>
>> > :)
>> >>
>> >> 2) You can often insert nops with lone or pair instructions which are
>> >> not 3 macro ops together, further proving that the above analysis is
>> >> correct.
>> >>
>> >> 3) The addmul_1 code I get is very close to the code obtained by
>> >> someone else through independent means, so I won't post it here. Once
>> >> the above tricks have been validated on other code, I'll commit the
>> >> addmul_1 code I have to the repo. Or perhaps someone else will
>> >> rediscover it from what I have written above.
>> >>
>> >> In fact I was only able to find about 16 different versions of
>> >> addmul_1 that run in 2.5c/l all of which look very much like the
>> >> solution obtained independently. The order and location of most
>> >> instructions is fixed by the dual requirements of having triplets of
>> >> macro-ops and having almost nothing run in ALU0 other than muls. There
>> >> are very few degrees of freedom.
>> >>
>> >> Bill.
>> >
>> > This is very, very cool and I am happy that this is discussed in
>> > public. Any chance to see some performance numbers before and after
>> > the checkin?
>> >
>> > <SNIP>
>> >
>> > Cheers,
>> >
>> > Michael
>>
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to