I had a go at implementing the suggestion to split the inputs to toom3 along non-limb boundaries. Basically if n = 1 mod 3 then I split along a half limb boundary.
Before we were splitting into k+1, k+1, k-1 limbs where k = n / 3. As there are some additions that take place, one would usually end up doing multiplications of k + 2 limbs. By splitting along half limb boundaries, one gets k+1/2, k+1/2, k limbs and even after additions one only ends up multiplying k+1 limbs. But the whole thing didn't work. It slowed it down by 5% on average. The reason is twofold. Firstly an extra copy of the first k+1/2 limbs of each of the operands needs to be made. Secondly one is adding, in a number of places, operands which are half a limb shifted, to operands which are not. I am assuming that this misaligned data causes a big penalty in cycles. Bill. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---
