Quoting George Woltman <[EMAIL PROTECTED]>:

> The carry-propagation step is the hardest to use threads 
> effectively.  I don't have any great ideas here -- not 
> that I've thought about it in any great detail.

>From the documentation I wrote back in the F24 days. The text
deals specifically with propagating the carries from Fermat-mod
representation, but I think similar ideas can make Mersenne-mod
carries multithread-friendly.

jasonp

Integrating the Carries
-----------------------

It seems a shame that propagating the carries requires a pass of its own.
So much other work gets crammed into the other three passes that it would
be nice to integrate the carries into the rest of the arithmetic.

The straightforward way to propagate carries would be to take the n1 x n2
matrix, release carries "typewriter-style" for all the real parts, wrap
the final carry around to the first imaginary part and then repeat the
process once for the imaginary part. After all, that's how the digits in
the Fermat residue are arranged.

As an example, suppose you had to release one-digit carries in:

    12 34 55 78     87 58 73 59
    92 46 59 49 + i 49 43 23 03
    49 57 39 07     32 54 96 97

The basic method starts with 12, leaves the 2 and carries the 1. Then
add the 1 to 34 (=35), keep the 5 and carry the 3, etc. When you finish
a row, start the next row. When the the real part is done (i.e. the carry
from the 07 is finished), start on the imaginary part (the 87). When
you get the last carry from *that* (the 97), flip its sign and start again.
This second pass of carries doesn't have to last long; just keep going
until the carry becomes 0.

Unfortunately, this has to be done every squaring and involves two more
passes through the data. Think of it as two more rounds of 50-clock
cache misses.

The pain can be cut in half by releasing the real and imaginary carries
together, in parallel. They never interact with each other until you get
to the end of the array, and the computations overlap since the Ultra is a
superscalar processor.

    12 34 55 78     87 58 73 59
    92 46 59 49 + i 49 43 23 03
    49 57 39 07     32 54 96 97

       data in     carry in      data out     carry out
       -------     --------      --------     ---------
       12+i87       0+i0     ->     2+i7        1+i8
       34+i58       1+i8     ->     5+i4        3+i6     (34+1)+i(58+8)     
       55+i73       3+i6     ->     8+i9        5+i7     (55+3)+i(73+6)
        etc.

Everything's the same until you get to the last element (07+i97). Before,
the carry had only one component, which was wrapped around and propagated
until zero. Now there are *two* carries to wrap around, the original one
and another that didn't exist before (the basic method would have destroyed
the second carry by now). To be rigorous, *both* carries must be propagated
until zero, but there's a chance the second carry will need an entire
pass through the array to get killed off. For the second pass, version 1.1
re-propagated all carries through only 10 elements of the array.

With 16 bits per double each answer would be 45-50 bits on average. That
means most carries are about 32 bits in size, and would last for about
two array elements as they get propagated. The only time 10 elements
wouldn't be enough to release the carries fully was if you had 8 elements
of 0xffff....fff, and that will only happen for 1 in 2^128 carries. The
chance of this happenning is sufficiently enormously small so that it's
safe to do, and to be sure that 10 elements are enough you simply check
that all carries are zero after the second pass.

Practically speaking, this trick lets you start propagating carries
anywhere in the array, as long as you wrap around again. More importantly,
you can start propagating *multiple* carries in parallel, and this is the
key to mixing carry propagation into the computations. Keep an array
of carries, one for each row in the n1 x n2 matrix. Then in pass 3 the
matrix is divided into blocks like so:

      |<---             n2              --->|

      |     |     |                   |     |
      |block|block|                   |block|
      |  1  |  2  |       ...         |n2/BL|
      |     |     |                   |     |

where previously all the computations on one block were independent of
all the others. Now update the column of carries after you finish each
block:

| |     |****************|            | |     |****************|
|c|     | block i after  | propagate  |c|     | block i+1 after|
|a|     | itwiddle, IFFT | carries    |a|     | itwiddle, IFFT |
|r|-->  |    and IDWT    | for ea --> |r|-->  |    and IDWT    | --> ...
|r|     |    weights     | row        |r|     |    weights     |
|y|     |****************|            |y|     |****************|

The point is that carries can be propagated while each block is being
processed, until you finish the last block. Then all the carries have
to "carriage return" to the next row, the last carry wraps around to
the first row like before, and the carry vector propagates 10 elements
into the first block

On the minus side, it's way more complicated than the basic approach.
But since everything is in cache from all the other computations in
pass 3 of the squaring, the carry propagation runs 30% faster and cuts
the total runtime by about 5%


------------------------------------------------------
This message was sent using BOO.net's Webmail.
http://www.boo.net/
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to