On Tuesday 21 July 2009 21:29:14 Bill Hart wrote: > I also tried simply testing for the top limb being zero, and if so to > only multiply length k operands instead of k+1 operands. I thought I > observed a small improvement when doing this, but it was hard to > measure. Then I decided to try and test if the top limb is 1, as that > can be handled with an extra addition instead of multiplying by k+1 > operands. > > The times were so nearly the same however that it was difficult to > discern anything above the noise. I therefore didn't commit these > changes. It certainly didn't change the overall bench score. > > This was surprising because at first blush a back of the envelope > calculation says that it should have made a noticeable difference. > Timing 40 limb multiplications vs 41 limb multiplications for example > does show a noticeable difference, 2x10^-6s vs 2.2x10^-6s, i.e. 10%. > Furthermore about 80% of the time in Toom3 is taken up with pointwise > mults for a multiplication of 121 limbs (which currently gets handled > as three 42 limb multiplications, one 41 limb multiplication and one > 40 limb multiplication). > > Of the five multiplications, three are likely to have zero in their > top limbs with a non-zero probability. One therefore expects a saving > of around 1.6% overall for each of the three multiplications you can > apply this trick to, minus some overheads. That's 4.8% overall. > > Of course that saving is only obtained in the case where one is > multiplying operands with n limbs where n is 1 mod 3 or 2 mod 3. > However in the 2 mod 3 case one is less likely to have zero top limbs. > > In practice however, one needs to do more careful analysis. In one > multiplication the top limb can be 0 or 1, in the next 0, 1 or 2, and > in the next 0, 1, .., 6. In the case of two of the multiplications the > probability of getting a zero in the top limbs is say 40%. So one > expects a saving overall of perhaps 1%. If one also checks for a 1 in > the top limb and uses an addition then the additions may cost about > 0.072x10^-6s, which is about 1/3 of the saving you get by using k+1 > limbs multiplications instead of k+2. If one checks for 2 in the top > limb the cost of addadds or addlsh's is about 0.110x10^-6s or about > half the expected saving. (Using addmul_1's for arbitrary values in > the top limbs costs about 0.15x10^-6s plus overheads, or little to no > saving). Anyhow, checking for 0, 1 or 2 in the top limbs should save > about 2.1% in total, minus some overheads. > > Of course this is a cherry picked example. In the case of 31 limbs vs > 32, the timings are 1.36x10^-6s vs 1.41x10^-6s or about 4%. The > additions cost 0.058x10^-6s, addadd's about 0.09x10^-6s. So here you > get a decent slowdown because the extra additions cost more than you > save. In fact only checking for zero in the top limb works for every > multiplication but nothing else does. But as mentioned you only get a > 1% saving in the best cases and a 0.4% saving in other cases, minus > overheads. The complete overall saving over all possible > multiplications is probably pretty close to 0.1-0.2%, which is almost > unmeasurable. Given that overall bench score doesn't depend too much > on toom3, the overall change in the bench is close to zero. > > Anyhow, I think this idea is dead. To speed up Toom 3 we really need a > much more significant improvement. I could submit the code which > checks for zero in the top limbs. That seems to be safe. > > Bill. > > On 19 July, 08:39, Bill Hart <[email protected]> wrote: > > I had a go at implementing the suggestion to split the inputs to toom3 > > along non-limb boundaries. Basically if n = 1 mod 3 then I split along > > a half limb boundary. > > > > Before we were splitting into k+1, k+1, k-1 limbs where k = n / 3. As > > there are some additions that take place, one would usually end up > > doing multiplications of k + 2 limbs. By splitting along half limb > > boundaries, one gets k+1/2, k+1/2, k limbs and even after additions > > one only ends up multiplying k+1 limbs. > > > > But the whole thing didn't work. It slowed it down by 5% on average. > > The reason is twofold. Firstly an extra copy of the first k+1/2 limbs > > of each of the operands needs to be made. Secondly one is adding, in a > > number of places, operands which are half a limb shifted, to operands > > which are not. I am assuming that this misaligned data causes a big > > penalty in cycles. > > > > Bill.
Misaligned data will segfault on some arches and instructions eg K10 and lshift . I assume the extra copy come from the fact that we have a mpn_addlsh1 but not a mpn_addlsh32 . Left and right shift by 32 or 8k can be done a lot faster than a general shift , and I suppose a addlsh8k would be quite a bit faster. This could also benefit our current FFT I think. Jason --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---
