Implementing (Montgomery) multiplication

Mike Hamburg Tue, 19 Jul 2011 13:42:34 -0700

Complicated question for implementers here.

I'm working on a new, constant-time crypto math core in the spirit of mpfq, but 
hopefully faster and not vulnerable to side-channel attacks.  I have a couple 
questions about Montgomery reduction techniques.  I'm targeting pairing-based 
applications, where p ~ 2^256, so imagine the modulus as being 4x64 bit words.  
I'm primarily targeting x86-64, but I'm also somewhat interested in ARM (and 
portable C).  In any case, suppose we have n w-bit words, where n is small.


Just a reminder of Montgomery reduction: let R=2^(nw)=2^256 in our example, and 
k = -p^-1 mod R.  Then given x which fits into 256-bit words, we can compute z 
= x/R mod p as

y := (x + (x*k mod R)*p) / R
z = y or y-p

Modding and dividing by R costs less than nothing, because you just ignore 
parts of the result.  There are two first steps to make this faster.  The 
standard one is to reduce in several steps, instead of all at once: just fix 
the last word of x, then the next-to-last word, etc.  The advantage is that you 
only need the low word of k.  That makes the algorithm run with 2n^2 MULs and n 
IMULs (only low bits needed).  Alternatively, you can note that (x*k mod R)*p 
needs only be computed approximately, because you're about to divide by R; this 
way costs 2n^2 + n - 1 MULs and n IMULs.

Additionally, there is a choice of scanning method when multiplying.  Product 
scanning computes a given place in the product in the inner loop, whereas 
operand scanning loops multiplies one input by a single word in the other input 
in each iteration.

So anyway, it seems to me that product scanning is preferable on x86, because 
MUL clobbers the carry flag, and you don't need to preserve a carry flag 
between MULs if you're doing product scanning: you just multiply and then 
accumulate the 128-bit result into a 192-bit accumulator.  Intuitively, it 
should also pipeline better on processors such as Westmere with big differences 
in latency for high and low sides of a MUL.  It seems that this is what 
Crypto++ does.

However, for Montgomery multiplication, product scanning seems less than ideal. 
 The 2n^2+n method has a dependency between the low words in each iteration, so 
computing these all sequentially will probably wedge the pipeline (but I'll run 
an experiment to verify this).  So this leaves options:

1) Do product scanning anyway.  I haven't tried this yet.
2) Do product scanning, but let some parts get ahead of others at the cost of 
some extra ADC $0's.
3) Do operand scanning by saving and restoring the carry flag, at the cost of 
~2 extra operations per MUL.
4) Do operand scanning by emulating ARM's UMAAL (multiply wide, double 
accumulate narrow, i.e. carry a whole word instead of a bit) at the cost of 
~2-3 extra operations per MUL, but perhaps with more favorable pipelining.
5) Implement the approximate technique instead.  This is more complicated and 
costs extra registers and extra MULs, but allows product scanning without 
confronting any of these issues.
6) Something else?  Some kind of hybrid?

On ARM I'll probably just try (3), because it's simple and you have UMAAL.

Does anyone have advice on which of these techniques to apply, and how?

I've done some preliminary tests with Core2 and Sandy Bridge on (3), (4) and 
(5).  It seems that (3)'s long dependency chain pipelines poorly; (4) is 
passable on Sandy Bridge (where the cheap ADC $0 makes my UMAAL emulation 
faster) but slow on Core2; (5) is fine on both but seems suboptimal because of 
the extra MULs.  But I haven't fully optimized these, and I'd like to hear some 
other devs' opinions.

Thanks for your time,
-- Mike Hamburg

-- 
You received this message because you are subscribed to the "Crypto++ Users" 
Google Group.
To unsubscribe, send an email to [email protected].
More information about Crypto++ and this group is available at 
http://www.cryptopp.com.

Implementing (Montgomery) multiplication

Reply via email to