From: Andy Polyakov <[email protected]>
Date: Thu, 20 Sep 2012 11:23:03 +0200

> There is no need to send me personal copy.

Ok, I was simply acknowledging the author of the code I was
touching :-)

> Could you ask your contact if they could provide second copy for
> OpenSSL?

I'll see what I can do, it took me more than a year of tireless
work and daily poking to get a copy for myself from people I've
been interacting with for a decade.

> You mentioned Montgomery BN. There will be intersections with
> other platforms. I mean there is interest to provide alternative
> framework for exponentiation that would benefit such cases and having
> look at multiple platforms including T4 would help to choose better
> strategy.

Here are how the instructions work.

The basic model is that there is a range of sizes supported by the
instruction, and all of the data is loaded into a combination of
the floating point registers and all of the register windows of
the cpu.

For exmaple, the montmul (Montgomery Multiply) instruction simply has
a 5-bit immediate field which indicates the size of the operands.
If it is set to N the operands are (N + 1) * 64-bits in size.

Nprime is stored in register %f60.

A[] values are stored in float and integer registers (integers go into
register window 5), in this order:

%l0,   %l1,  %l2,  %l3,  %l4,  %l5,  %l6,  %l7
%o0,   %o1,  %o2,  %o3,  %o4,  %o5, %f24, %f26
%f28, %f30, %f32, %f34, %f36, %f38, %f40, %f42
$f44, %f46, %f48, %f50, %f52, %f54, %f56, %f58

B[] values are stored in integer registers (3 register windows, 2 to 0):

%o0, %o1, %o2, %o3, %o4, %o5,           (register window 2)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7  (register window 1)
%o0, %o1, %o2, %o3, %o4, %o5            (register window 1)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7  (register window 0)
%o0, %o1, %o2, %o3                      (register window 0)

Similarly for the other inputs, you can see the pattern in use here.
The result is left in register window 5.  If an internal ECC error
occurs on the register file during the operation, %fcc3 will be set to
unordered.  This means there needs to be a limited retry loop over
this condition.

So basically the implementation starts at register window zero, loads
all the initial values of B[], does a 'save', loads the middle values
ot B[], does a 'save', leads the last values of B[].

Then it moves on the N[], which goes into register windows 2, 3, and
4.

Next comes A[], in floating point registers and register window 5.

And finally M[], in floating point registers and register window 6.

Nprime is loaded into %f60 and the montmul instruction is executed.

This instruction can essentially be used directly via the
bn_mul_mont() function signature in openssl().  I don't think
any special amends are necessary to facilitate the use of these
instructions.

The 'montsqr' (Montgomery Square) instruction uses the same scheme
and layout as 'montmul' for inputs and outputs.

Finally 'mpmul' (Multiple Precision Multiply) has a similar flavor
to montmul and montsqr, in that multiple register windows and the
float point registers are used to load the inputs all at once for
the operation.

Again, a 5-bit immedate field 'N' encodes the size of the operands,
as "(N + 1) * 64-bits".

The multiplier goes into a mixture of float regs and integer registers
in register window 6.  The multiplicand goes into a mixture of float
regs and integer registers in register window 5, and the product goes
into integer registers in register windows 4, 3, 2, 1, and 0.

For example, to do a 2048 bit multiply given a pointer to the
multiplier in %g1, a pointer to the multiplicand in %g2, and
a pointer to the place to store the product in %g3 one would
go:

        /* Register window 6 */
        ldd     [%g1 + 0x000], %f22
        ldd     [%g1 + 0x008], %f20
        ldd     [%g1 + 0x010], %f18
        ldd     [%g1 + 0x018], %f16
        ldd     [%g1 + 0x020], %f14
        ldd     [%g1 + 0x028], %f12
        ldd     [%g1 + 0x030], %f10
        ldd     [%g1 + 0x038], %f8
        ldd     [%g1 + 0x040], %f6
        ldd     [%g1 + 0x048], %f4
        ldx     [%g1 + 0x050], %i5
        ldx     [%g1 + 0x058], %i4
        ldx     [%g1 + 0x060], %i3
        ldx     [%g1 + 0x068], %i2
        ldx     [%g1 + 0x070], %i1
        ldx     [%g1 + 0x078], %i0
        ldx     [%g1 + 0x080], %l7
        ldx     [%g1 + 0x088], %l6
        ldx     [%g1 + 0x090], %l5
        ldx     [%g1 + 0x098], %l4
        ldx     [%g1 + 0x0a0], %l3
        ldx     [%g1 + 0x0a8], %l2
        ldx     [%g1 + 0x0b0], %l1
        ldx     [%g1 + 0x0b8], %l0
        ldd     [%g1 + 0x0c0], %f2
        ldd     [%g1 + 0x0c8], %f0
        ldx     [%g1 + 0x0d0], %o5
        ldx     [%g1 + 0x0d8], %o4
        ldx     [%g1 + 0x0e0], %o3
        ldx     [%g1 + 0x0e8], %o2
        ldx     [%g1 + 0x0f0], %o1
        ldx     [%g1 + 0x0f8], %g1

        save

        /* Register window 5 */
        ldd     [%g2 + 0x000], %f58
        ldd     [%g2 + 0x008], %f56
        ldd     [%g2 + 0x010], %f54
        ldd     [%g2 + 0x018], %f52
        ldd     [%g2 + 0x020], %f50
        ldd     [%g2 + 0x028], %f48
        ldd     [%g2 + 0x030], %f46
        ldd     [%g2 + 0x038], %f44
        ldd     [%g2 + 0x040], %f42
        ldd     [%g2 + 0x048], %f40
        ldd     [%g2 + 0x050], %f38
        ldd     [%g2 + 0x058], %f36
        ldd     [%g2 + 0x060], %f34
        ldd     [%g2 + 0x068], %f32
        ldd     [%g2 + 0x070], %f30
        ldd     [%g2 + 0x078], %f28
        ldd     [%g2 + 0x080], %f26
        ldd     [%g2 + 0x088], %f24
        ldx     [%g2 + 0x090], %o5
        ldx     [%g2 + 0x098], %o4
        ldx     [%g2 + 0x0a0], %o3
        ldx     [%g2 + 0x0a8], %o2
        ldx     [%g2 + 0x0b0], %o1
        ldx     [%g2 + 0x0b8], %o0
        ldx     [%g2 + 0x0c0], %l7
        ldx     [%g2 + 0x0c8], %l6
        ldx     [%g2 + 0x0d0], %l5
        ldx     [%g2 + 0x0d8], %l4
        ldx     [%g2 + 0x0e0], %l3
        ldx     [%g2 + 0x0e8], %l2
        ldx     [%g2 + 0x0f0], %l1
        ldx     [%g2 + 0x0f8], %l0

        save
        save
        save
        save
        save

        /* Register window 0 */
        mpmul   0x1f

        stx     %l7, [%g3 + 0x000]
        stx     %l6, [%g3 + 0x008]
        stx     %l5, [%g3 + 0x010]
        stx     %l4, [%g3 + 0x018]
        stx     %l3, [%g3 + 0x020]
        stx     %l2, [%g3 + 0x028]
        stx     %l1, [%g3 + 0x030]
        stx     %l0, [%g3 + 0x038]

        restore

        /* Register window 1 */
        stx     %o5, [%g3 + 0x040]
        stx     %o4, [%g3 + 0x048]
        stx     %o3, [%g3 + 0x050]
        stx     %o2, [%g3 + 0x058]
        stx     %o1, [%g3 + 0x060]
        stx     %o0, [%g3 + 0x068]
        stx     %l7, [%g3 + 0x070]
        stx     %l6, [%g3 + 0x078]
        stx     %l5, [%g3 + 0x080]
        stx     %l4, [%g3 + 0x088]
        stx     %l3, [%g3 + 0x090]
        stx     %l2, [%g3 + 0x098]
        stx     %l1, [%g3 + 0x0a0]
        stx     %l0, [%g3 + 0x0a8]

        restore

        /* Register window 2 */
        stx     %o5, [%g3 + 0x0b0]
        stx     %o4, [%g3 + 0x0b8]
        stx     %o3, [%g3 + 0x0c0]
        stx     %o2, [%g3 + 0x0c8]
        stx     %o1, [%g3 + 0x0d0]
        stx     %o0, [%g3 + 0x0d8]
        stx     %l7, [%g3 + 0x0e0]
        stx     %l6, [%g3 + 0x0e8]
        stx     %l5, [%g3 + 0x0f0]
        stx     %l4, [%g3 + 0x0f8]
        stx     %l3, [%g3 + 0x100]
        stx     %l2, [%g3 + 0x108]
        stx     %l1, [%g3 + 0x110]
        stx     %l0, [%g3 + 0x118]

        restore

        /* Register window 3 */
        stx     %o5, [%g3 + 0x120]
        stx     %o4, [%g3 + 0x128]
        stx     %o3, [%g3 + 0x130]
        stx     %o2, [%g3 + 0x138]
        stx     %o1, [%g3 + 0x140]
        stx     %o0, [%g3 + 0x148]
        stx     %l7, [%g3 + 0x150]
        stx     %l6, [%g3 + 0x158]
        stx     %l5, [%g3 + 0x160]
        stx     %l4, [%g3 + 0x168]
        stx     %l3, [%g3 + 0x170]
        stx     %l2, [%g3 + 0x178]
        stx     %l1, [%g3 + 0x180]
        stx     %l0, [%g3 + 0x188]

        restore

        /* Register window 4 */
        stx     %o5, [%g3 + 0x190]
        stx     %o4, [%g3 + 0x198]
        stx     %o3, [%g3 + 0x1a0]
        stx     %o2, [%g3 + 0x1a8]
        stx     %o1, [%g3 + 0x1b0]
        stx     %o0, [%g3 + 0x1b8]
        stx     %l7, [%g3 + 0x1c0]
        stx     %l6, [%g3 + 0x1c8]
        stx     %l5, [%g3 + 0x1d0]
        stx     %l4, [%g3 + 0x1d8]
        stx     %l3, [%g3 + 0x1e0]
        stx     %l2, [%g3 + 0x1e8]
        stx     %l1, [%g3 + 0x1f0]
        stx     %l0, [%g3 + 0x1f8]

        restore
        restore

Of course, you might quickly ask what happens in 32-bit mode?  If we
were to take a window save trap, it would clobber the upper 32-bits of
the 64-bit values we are loading into the register file.

You have to do a trick in this case by loading a cookie of some sort
(say, simply 0xffffffffffffffff) into one of the unused registers
in the initial register window.  If, after the instruction executes,
the top 32-bits are zeroed out, you know that a window trap happened
and therefore you must retry.

This retry logic can be combined with the tests for ECC errors on
%fcc3.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to