From: Andy Polyakov <[email protected]>
Date: Thu, 20 Sep 2012 11:23:03 +0200
> There is no need to send me personal copy.
Ok, I was simply acknowledging the author of the code I was
touching :-)
> Could you ask your contact if they could provide second copy for
> OpenSSL?
I'll see what I can do, it took me more than a year of tireless
work and daily poking to get a copy for myself from people I've
been interacting with for a decade.
> You mentioned Montgomery BN. There will be intersections with
> other platforms. I mean there is interest to provide alternative
> framework for exponentiation that would benefit such cases and having
> look at multiple platforms including T4 would help to choose better
> strategy.
Here are how the instructions work.
The basic model is that there is a range of sizes supported by the
instruction, and all of the data is loaded into a combination of
the floating point registers and all of the register windows of
the cpu.
For exmaple, the montmul (Montgomery Multiply) instruction simply has
a 5-bit immediate field which indicates the size of the operands.
If it is set to N the operands are (N + 1) * 64-bits in size.
Nprime is stored in register %f60.
A[] values are stored in float and integer registers (integers go into
register window 5), in this order:
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7
%o0, %o1, %o2, %o3, %o4, %o5, %f24, %f26
%f28, %f30, %f32, %f34, %f36, %f38, %f40, %f42
$f44, %f46, %f48, %f50, %f52, %f54, %f56, %f58
B[] values are stored in integer registers (3 register windows, 2 to 0):
%o0, %o1, %o2, %o3, %o4, %o5, (register window 2)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 (register window 1)
%o0, %o1, %o2, %o3, %o4, %o5 (register window 1)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 (register window 0)
%o0, %o1, %o2, %o3 (register window 0)
Similarly for the other inputs, you can see the pattern in use here.
The result is left in register window 5. If an internal ECC error
occurs on the register file during the operation, %fcc3 will be set to
unordered. This means there needs to be a limited retry loop over
this condition.
So basically the implementation starts at register window zero, loads
all the initial values of B[], does a 'save', loads the middle values
ot B[], does a 'save', leads the last values of B[].
Then it moves on the N[], which goes into register windows 2, 3, and
4.
Next comes A[], in floating point registers and register window 5.
And finally M[], in floating point registers and register window 6.
Nprime is loaded into %f60 and the montmul instruction is executed.
This instruction can essentially be used directly via the
bn_mul_mont() function signature in openssl(). I don't think
any special amends are necessary to facilitate the use of these
instructions.
The 'montsqr' (Montgomery Square) instruction uses the same scheme
and layout as 'montmul' for inputs and outputs.
Finally 'mpmul' (Multiple Precision Multiply) has a similar flavor
to montmul and montsqr, in that multiple register windows and the
float point registers are used to load the inputs all at once for
the operation.
Again, a 5-bit immedate field 'N' encodes the size of the operands,
as "(N + 1) * 64-bits".
The multiplier goes into a mixture of float regs and integer registers
in register window 6. The multiplicand goes into a mixture of float
regs and integer registers in register window 5, and the product goes
into integer registers in register windows 4, 3, 2, 1, and 0.
For example, to do a 2048 bit multiply given a pointer to the
multiplier in %g1, a pointer to the multiplicand in %g2, and
a pointer to the place to store the product in %g3 one would
go:
/* Register window 6 */
ldd [%g1 + 0x000], %f22
ldd [%g1 + 0x008], %f20
ldd [%g1 + 0x010], %f18
ldd [%g1 + 0x018], %f16
ldd [%g1 + 0x020], %f14
ldd [%g1 + 0x028], %f12
ldd [%g1 + 0x030], %f10
ldd [%g1 + 0x038], %f8
ldd [%g1 + 0x040], %f6
ldd [%g1 + 0x048], %f4
ldx [%g1 + 0x050], %i5
ldx [%g1 + 0x058], %i4
ldx [%g1 + 0x060], %i3
ldx [%g1 + 0x068], %i2
ldx [%g1 + 0x070], %i1
ldx [%g1 + 0x078], %i0
ldx [%g1 + 0x080], %l7
ldx [%g1 + 0x088], %l6
ldx [%g1 + 0x090], %l5
ldx [%g1 + 0x098], %l4
ldx [%g1 + 0x0a0], %l3
ldx [%g1 + 0x0a8], %l2
ldx [%g1 + 0x0b0], %l1
ldx [%g1 + 0x0b8], %l0
ldd [%g1 + 0x0c0], %f2
ldd [%g1 + 0x0c8], %f0
ldx [%g1 + 0x0d0], %o5
ldx [%g1 + 0x0d8], %o4
ldx [%g1 + 0x0e0], %o3
ldx [%g1 + 0x0e8], %o2
ldx [%g1 + 0x0f0], %o1
ldx [%g1 + 0x0f8], %g1
save
/* Register window 5 */
ldd [%g2 + 0x000], %f58
ldd [%g2 + 0x008], %f56
ldd [%g2 + 0x010], %f54
ldd [%g2 + 0x018], %f52
ldd [%g2 + 0x020], %f50
ldd [%g2 + 0x028], %f48
ldd [%g2 + 0x030], %f46
ldd [%g2 + 0x038], %f44
ldd [%g2 + 0x040], %f42
ldd [%g2 + 0x048], %f40
ldd [%g2 + 0x050], %f38
ldd [%g2 + 0x058], %f36
ldd [%g2 + 0x060], %f34
ldd [%g2 + 0x068], %f32
ldd [%g2 + 0x070], %f30
ldd [%g2 + 0x078], %f28
ldd [%g2 + 0x080], %f26
ldd [%g2 + 0x088], %f24
ldx [%g2 + 0x090], %o5
ldx [%g2 + 0x098], %o4
ldx [%g2 + 0x0a0], %o3
ldx [%g2 + 0x0a8], %o2
ldx [%g2 + 0x0b0], %o1
ldx [%g2 + 0x0b8], %o0
ldx [%g2 + 0x0c0], %l7
ldx [%g2 + 0x0c8], %l6
ldx [%g2 + 0x0d0], %l5
ldx [%g2 + 0x0d8], %l4
ldx [%g2 + 0x0e0], %l3
ldx [%g2 + 0x0e8], %l2
ldx [%g2 + 0x0f0], %l1
ldx [%g2 + 0x0f8], %l0
save
save
save
save
save
/* Register window 0 */
mpmul 0x1f
stx %l7, [%g3 + 0x000]
stx %l6, [%g3 + 0x008]
stx %l5, [%g3 + 0x010]
stx %l4, [%g3 + 0x018]
stx %l3, [%g3 + 0x020]
stx %l2, [%g3 + 0x028]
stx %l1, [%g3 + 0x030]
stx %l0, [%g3 + 0x038]
restore
/* Register window 1 */
stx %o5, [%g3 + 0x040]
stx %o4, [%g3 + 0x048]
stx %o3, [%g3 + 0x050]
stx %o2, [%g3 + 0x058]
stx %o1, [%g3 + 0x060]
stx %o0, [%g3 + 0x068]
stx %l7, [%g3 + 0x070]
stx %l6, [%g3 + 0x078]
stx %l5, [%g3 + 0x080]
stx %l4, [%g3 + 0x088]
stx %l3, [%g3 + 0x090]
stx %l2, [%g3 + 0x098]
stx %l1, [%g3 + 0x0a0]
stx %l0, [%g3 + 0x0a8]
restore
/* Register window 2 */
stx %o5, [%g3 + 0x0b0]
stx %o4, [%g3 + 0x0b8]
stx %o3, [%g3 + 0x0c0]
stx %o2, [%g3 + 0x0c8]
stx %o1, [%g3 + 0x0d0]
stx %o0, [%g3 + 0x0d8]
stx %l7, [%g3 + 0x0e0]
stx %l6, [%g3 + 0x0e8]
stx %l5, [%g3 + 0x0f0]
stx %l4, [%g3 + 0x0f8]
stx %l3, [%g3 + 0x100]
stx %l2, [%g3 + 0x108]
stx %l1, [%g3 + 0x110]
stx %l0, [%g3 + 0x118]
restore
/* Register window 3 */
stx %o5, [%g3 + 0x120]
stx %o4, [%g3 + 0x128]
stx %o3, [%g3 + 0x130]
stx %o2, [%g3 + 0x138]
stx %o1, [%g3 + 0x140]
stx %o0, [%g3 + 0x148]
stx %l7, [%g3 + 0x150]
stx %l6, [%g3 + 0x158]
stx %l5, [%g3 + 0x160]
stx %l4, [%g3 + 0x168]
stx %l3, [%g3 + 0x170]
stx %l2, [%g3 + 0x178]
stx %l1, [%g3 + 0x180]
stx %l0, [%g3 + 0x188]
restore
/* Register window 4 */
stx %o5, [%g3 + 0x190]
stx %o4, [%g3 + 0x198]
stx %o3, [%g3 + 0x1a0]
stx %o2, [%g3 + 0x1a8]
stx %o1, [%g3 + 0x1b0]
stx %o0, [%g3 + 0x1b8]
stx %l7, [%g3 + 0x1c0]
stx %l6, [%g3 + 0x1c8]
stx %l5, [%g3 + 0x1d0]
stx %l4, [%g3 + 0x1d8]
stx %l3, [%g3 + 0x1e0]
stx %l2, [%g3 + 0x1e8]
stx %l1, [%g3 + 0x1f0]
stx %l0, [%g3 + 0x1f8]
restore
restore
Of course, you might quickly ask what happens in 32-bit mode? If we
were to take a window save trap, it would clobber the upper 32-bits of
the 64-bit values we are loading into the register file.
You have to do a trick in this case by loading a cookie of some sort
(say, simply 0xffffffffffffffff) into one of the unused registers
in the initial register window. If, after the instruction executes,
the top 32-bits are zeroed out, you know that a window trap happened
and therefore you must retry.
This retry logic can be combined with the tests for ECC errors on
%fcc3.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [email protected]