Complicated question for implementers here. I'm working on a new, constant-time crypto math core in the spirit of mpfq, but hopefully faster and not vulnerable to side-channel attacks. I have a couple questions about Montgomery reduction techniques. I'm targeting pairing-based applications, where p ~ 2^256, so imagine the modulus as being 4x64 bit words. I'm primarily targeting x86-64, but I'm also somewhat interested in ARM (and portable C). In any case, suppose we have n w-bit words, where n is small.
Just a reminder of Montgomery reduction: let R=2^(nw)=2^256 in our example, and k = -p^-1 mod R. Then given x which fits into 256-bit words, we can compute z = x/R mod p as y := (x + (x*k mod R)*p) / R z = y or y-p Modding and dividing by R costs less than nothing, because you just ignore parts of the result. There are two first steps to make this faster. The standard one is to reduce in several steps, instead of all at once: just fix the last word of x, then the next-to-last word, etc. The advantage is that you only need the low word of k. That makes the algorithm run with 2n^2 MULs and n IMULs (only low bits needed). Alternatively, you can note that (x*k mod R)*p needs only be computed approximately, because you're about to divide by R; this way costs 2n^2 + n - 1 MULs and n IMULs. Additionally, there is a choice of scanning method when multiplying. Product scanning computes a given place in the product in the inner loop, whereas operand scanning loops multiplies one input by a single word in the other input in each iteration. So anyway, it seems to me that product scanning is preferable on x86, because MUL clobbers the carry flag, and you don't need to preserve a carry flag between MULs if you're doing product scanning: you just multiply and then accumulate the 128-bit result into a 192-bit accumulator. Intuitively, it should also pipeline better on processors such as Westmere with big differences in latency for high and low sides of a MUL. It seems that this is what Crypto++ does. However, for Montgomery multiplication, product scanning seems less than ideal. The 2n^2+n method has a dependency between the low words in each iteration, so computing these all sequentially will probably wedge the pipeline (but I'll run an experiment to verify this). So this leaves options: 1) Do product scanning anyway. I haven't tried this yet. 2) Do product scanning, but let some parts get ahead of others at the cost of some extra ADC $0's. 3) Do operand scanning by saving and restoring the carry flag, at the cost of ~2 extra operations per MUL. 4) Do operand scanning by emulating ARM's UMAAL (multiply wide, double accumulate narrow, i.e. carry a whole word instead of a bit) at the cost of ~2-3 extra operations per MUL, but perhaps with more favorable pipelining. 5) Implement the approximate technique instead. This is more complicated and costs extra registers and extra MULs, but allows product scanning without confronting any of these issues. 6) Something else? Some kind of hybrid? On ARM I'll probably just try (3), because it's simple and you have UMAAL. Does anyone have advice on which of these techniques to apply, and how? I've done some preliminary tests with Core2 and Sandy Bridge on (3), (4) and (5). It seems that (3)'s long dependency chain pipelines poorly; (4) is passable on Sandy Bridge (where the cheap ADC $0 makes my UMAAL emulation faster) but slow on Core2; (5) is fine on both but seems suboptimal because of the extra MULs. But I haven't fully optimized these, and I'd like to hear some other devs' opinions. Thanks for your time, -- Mike Hamburg -- You received this message because you are subscribed to the "Crypto++ Users" Google Group. To unsubscribe, send an email to [email protected]. More information about Crypto++ and this group is available at http://www.cryptopp.com.
