On Tuesday 21 July 2009 21:29:14 Bill Hart wrote:
> I also tried simply testing for the top limb being zero, and if so to
> only multiply length k operands instead of k+1 operands. I thought I
> observed a small improvement when doing this, but it was hard to
> measure. Then I decided to try and test if the top limb is 1, as that
> can be handled with an extra addition instead of multiplying by k+1
> operands.
>
> The times were so nearly the same however that it was difficult to
> discern anything above the noise. I therefore didn't commit these
> changes. It certainly didn't change the overall bench score.
>
> This was surprising because at first blush a back of the envelope
> calculation says that it should have made a noticeable difference.
> Timing 40 limb multiplications vs 41 limb multiplications for example
> does show a noticeable difference, 2x10^-6s vs 2.2x10^-6s, i.e. 10%.
> Furthermore about 80% of the time in Toom3 is taken up with pointwise
> mults for a multiplication of 121 limbs (which currently gets handled
> as three 42 limb multiplications, one 41 limb multiplication and one
> 40 limb multiplication).
>
> Of the five multiplications, three are likely to have zero in their
> top limbs with a non-zero probability. One therefore expects a saving
> of around 1.6% overall for each of the three multiplications you can
> apply this trick to, minus some overheads. That's 4.8% overall.
>
> Of course that saving is only obtained in the case where one is
> multiplying operands with n limbs where n is 1 mod 3 or 2 mod 3.
> However in the 2 mod 3 case one is less likely to have zero top limbs.
>
> In practice however, one needs to do more careful analysis. In one
> multiplication the top limb can be 0 or 1, in the next 0, 1 or 2, and
> in the next 0, 1, .., 6. In the case of two of the multiplications the
> probability of getting a zero in the top limbs is say 40%. So one
> expects a saving overall of perhaps 1%. If one also checks for a 1 in
> the top limb and uses an addition then  the additions may cost about
> 0.072x10^-6s, which is about 1/3 of the saving you get by using k+1
> limbs multiplications instead of k+2. If one checks for 2 in the top
> limb the cost of addadds or addlsh's is about 0.110x10^-6s or about
> half the expected saving. (Using addmul_1's for arbitrary values in
> the top limbs costs about 0.15x10^-6s plus overheads, or little to no
> saving). Anyhow, checking for 0, 1 or 2 in the top limbs should save
> about 2.1% in total, minus some overheads.
>
> Of course this is a cherry picked example. In the case of 31 limbs vs
> 32, the timings are 1.36x10^-6s vs 1.41x10^-6s or about 4%. The
> additions cost 0.058x10^-6s, addadd's about 0.09x10^-6s. So here you
> get a decent slowdown because the extra additions cost more than you
> save. In fact only checking for zero in the top limb works for every
> multiplication but nothing else does. But as mentioned you only get a
> 1% saving in the best cases and a 0.4% saving in other cases, minus
> overheads. The complete overall saving over all possible
> multiplications is probably pretty close to 0.1-0.2%, which is almost
> unmeasurable. Given that overall bench score doesn't depend too much
> on toom3, the overall change in the bench is close to zero.
>
> Anyhow, I think this idea is dead. To speed up Toom 3 we really need a
> much more significant improvement. I could submit the code which
> checks for zero in the top limbs. That seems to be safe.
>
> Bill.
>
> On 19 July, 08:39, Bill Hart <[email protected]> wrote:
> > I had a go at implementing the suggestion to split the inputs to toom3
> > along non-limb boundaries. Basically if n = 1 mod 3 then I split along
> > a half limb boundary.
> >
> > Before we were splitting into k+1, k+1, k-1 limbs where k = n / 3. As
> > there are some additions that take place, one would usually end up
> > doing multiplications of k + 2 limbs. By splitting along half limb
> > boundaries, one gets k+1/2, k+1/2, k limbs and even after additions
> > one only ends up multiplying k+1 limbs.
> >
> > But the whole thing didn't work. It slowed it down by 5% on average.
> > The reason is twofold. Firstly an extra copy of the first k+1/2 limbs
> > of each of the operands needs to be made. Secondly one is adding, in a
> > number of places, operands which are half a limb shifted, to operands
> > which are not. I am assuming that this misaligned data causes a big
> > penalty in cycles.
> >
> > Bill.

Misaligned data will segfault on some arches and instructions eg K10 and 
lshift . I assume the extra copy come from the fact  that we have a 
mpn_addlsh1 but not a mpn_addlsh32 . Left and right shift by 32 or 8k can be 
done a lot faster than a general shift , and I suppose a addlsh8k would be 
quite a bit faster. This could also benefit our current FFT I think.

Jason
 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to