I also tried simply testing for the top limb being zero, and if so to
only multiply length k operands instead of k+1 operands. I thought I
observed a small improvement when doing this, but it was hard to
measure. Then I decided to try and test if the top limb is 1, as that
can be handled with an extra addition instead of multiplying by k+1
operands.

The times were so nearly the same however that it was difficult to
discern anything above the noise. I therefore didn't commit these
changes. It certainly didn't change the overall bench score.

This was surprising because at first blush a back of the envelope
calculation says that it should have made a noticeable difference.
Timing 40 limb multiplications vs 41 limb multiplications for example
does show a noticeable difference, 2x10^-6s vs 2.2x10^-6s, i.e. 10%.
Furthermore about 80% of the time in Toom3 is taken up with pointwise
mults for a multiplication of 121 limbs (which currently gets handled
as three 42 limb multiplications, one 41 limb multiplication and one
40 limb multiplication).

Of the five multiplications, three are likely to have zero in their
top limbs with a non-zero probability. One therefore expects a saving
of around 1.6% overall for each of the three multiplications you can
apply this trick to, minus some overheads. That's 4.8% overall.

Of course that saving is only obtained in the case where one is
multiplying operands with n limbs where n is 1 mod 3 or 2 mod 3.
However in the 2 mod 3 case one is less likely to have zero top limbs.

In practice however, one needs to do more careful analysis. In one
multiplication the top limb can be 0 or 1, in the next 0, 1 or 2, and
in the next 0, 1, .., 6. In the case of two of the multiplications the
probability of getting a zero in the top limbs is say 40%. So one
expects a saving overall of perhaps 1%. If one also checks for a 1 in
the top limb and uses an addition then  the additions may cost about
0.072x10^-6s, which is about 1/3 of the saving you get by using k+1
limbs multiplications instead of k+2. If one checks for 2 in the top
limb the cost of addadds or addlsh's is about 0.110x10^-6s or about
half the expected saving. (Using addmul_1's for arbitrary values in
the top limbs costs about 0.15x10^-6s plus overheads, or little to no
saving). Anyhow, checking for 0, 1 or 2 in the top limbs should save
about 2.1% in total, minus some overheads.

Of course this is a cherry picked example. In the case of 31 limbs vs
32, the timings are 1.36x10^-6s vs 1.41x10^-6s or about 4%. The
additions cost 0.058x10^-6s, addadd's about 0.09x10^-6s. So here you
get a decent slowdown because the extra additions cost more than you
save. In fact only checking for zero in the top limb works for every
multiplication but nothing else does. But as mentioned you only get a
1% saving in the best cases and a 0.4% saving in other cases, minus
overheads. The complete overall saving over all possible
multiplications is probably pretty close to 0.1-0.2%, which is almost
unmeasurable. Given that overall bench score doesn't depend too much
on toom3, the overall change in the bench is close to zero.

Anyhow, I think this idea is dead. To speed up Toom 3 we really need a
much more significant improvement. I could submit the code which
checks for zero in the top limbs. That seems to be safe.

Bill.

On 19 July, 08:39, Bill Hart <[email protected]> wrote:
> I had a go at implementing the suggestion to split the inputs to toom3
> along non-limb boundaries. Basically if n = 1 mod 3 then I split along
> a half limb boundary.
>
> Before we were splitting into k+1, k+1, k-1 limbs where k = n / 3. As
> there are some additions that take place, one would usually end up
> doing multiplications of k + 2 limbs. By splitting along half limb
> boundaries, one gets k+1/2, k+1/2, k limbs and even after additions
> one only ends up multiplying k+1 limbs.
>
> But the whole thing didn't work. It slowed it down by 5% on average.
> The reason is twofold. Firstly an extra copy of the first k+1/2 limbs
> of each of the operands needs to be made. Secondly one is adding, in a
> number of places, operands which are half a limb shifted, to operands
> which are not. I am assuming that this misaligned data causes a big
> penalty in cycles.
>
> Bill.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to