Re: Montgomery multiplication on nVidia GPUs

Eben Tue, 28 Aug 2007 15:48:38 -0700

>> I've put together some code to do parallel 512-bit montgomery
>> multiplication for nVidia GPUs. On my 8800GTX I get about 12k of these
>> per second in 2k batches,


>What does "12k in 2k" mean? I mean given that 512 bits is 64 bytes does 2k
>mean that you process 32 vectors at once and 12k means that is takes
>1/(12000/32) seconds to perform 32 private key operations?

Worse than that. I perform montgomery exponentiation (not multiplication -
my bad) in batches of 2048 operations at a time, with each batch taking
approximately 160ms. That gives an average latency of 240ms for a fully
loaded system which processes one batch whilst building another.

>> so enough to do 6k 1024-bit RSA private decrypts.

>Why do you think that it will be 6k? Complexity is n2 so it should be
12/4, >not 12/2.

If you have the 512-bit factors p and q of the 1024-bit private key, the
majority of the time in the openssl implementation seems to go into a pair
of 512-bit mont exp operations for each private decrypt, followed by a bit
of CRT to bang them together. Obviously if you don't have p and q you need
a single 1024-bit exponentiation, which you would expect to be roughly
twice as costly, and may run into memory limitations on the GPU. I haven't
tackled that case.

> On this basis it looks like GPUs make pretty competitive cheap
> crypto accelerators :)
>
> I don't have the time or expertise to integrate this as a patch. Is there
> anyone who would be interested in taking this on? A significant challenge
> would be to provide a sensible interface for batch processing of requests;
> the current interface doesn't look well suited to this task.

> Trouble with crypto, especially with asymmetric, is that there is no
guaranteed data flow. I mean there is no guarantee that there will be 32
operations every 32/12 milliseconds and given lower average request rate
batching 32 requests will incur longer average service times you might
not be willing to tolerate: when a request is posed it's expected to be
served instantly, not wait till 31 extra ones are posed. So you'd have
to be adaptive, i.e. grab as many requests as currently available and
process sometimes 1, sometimes 5, 2, 15, 9, etc. operations at once. Now
what if average load is less than 12000/32 requests per second? Then
you'd go with single request at a time practically all the time. But
single operation performance is *far* from impressive. So you'd have to
play even smarter, grab currently outstanding requests and decide if it
would pay off to pack them together and send to GPU or just process them
on CPU. Such dispatching per se implies certain requirements on
application design (as it would be hardly appropriate to delegate this
role to say OpenSSL engine), so it's likely to be impossible to "just
use GPU with arbitrary application."

I agree with most of that. However, based on benchmarks on my desktop (a
Core 2 Duo E6400) the 32-bit x86 assembler mont exp implementation in
OpenSSL seems a _lot_ slower than my GPU. There are plenty of applications
(web serving for example) which can tolerate 240ms of latency in return
for a fivefold increase in throughput.

I suppose my question is whether anyone has considered providing an
alternative, asynchronous, interface to the OpenSSL crypto libraries. The
current API is not a suitable abstraction for a crypto device which
exhibits latency comparable to or greater than its execution time, unless
you're prepared to fire off a lot of threads in the client. Either batched
submission or callbacks would address this.

Cheers
Eben
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [EMAIL PROTECTED]

Re: Montgomery multiplication on nVidia GPUs

Reply via email to