Hello,

I'm working on developing an OpenSSL engine to take advantage of
various encryption algorithms which can be loaded into an FPGA
which resides in a host computer system (SGI Altix, FWIW).

These algorithms see an approximately 2x speedup if the FPGA
hardware can directly access the system memory which contains
the input and output buffers, rather than having the CPU push
the data into the FPGA (i.e. the FPGA device becomes a DMA
master).  To achieve this, the system memory of course needs
to be locked down (i.e. non-swappable), be contiguous over
the length of the entire buffer (possibly hundreds of megabytes),
and needs to obey certain alignment restrictions (128 bytes
in our particular case).

This can all be achieved fairly readily, as long as the application
programmer is aware of this, by allocating the correct memory and
passing these as the buffers to EVP_EncryptUpdate(), and ensuring
that only even multiples of the encryption block length are passed
down to EVP_EncryptUpdate().  One suitable way to do this on Linux
is through the use of the hugeltb filesystem to perform the
allocations.

However, there is a problem with the tail data processing that
may occur in EVP_EncryptFinal_ex(), or when the provided data
doesn't fill an encryption block in EVP_EncryptUpdate().  In
these functions, ctx->cipher->do_cipher() is called using
ctx->buf as the input data source.  This buffer is not allocated
in any special manner to ensure the block size, alignment,
and page locking we require.

I'm not sure who the authority on this would be, but in general
do you think it would meet with acceptance for mainline OpenSSL
inclusion if I provided an extension to the encryption engine
interface to provide a means to allocate this buffer via an
engine entrypoint (i.e. alongside do_cipher, init, finish, and ctrl)?

I'd also like to expose this as a top-level EVP_CIPHER_CTX_malloc()
function, so that applications could perform engine-optimized
allocations without needing to be aware of the specifics of how
that allocation should occur, leaving those specifics to the
engine code.

If there were another way around this I would certainly do so, but
as it's a limitation of the DMA hardware, and gives us such a
substantial boost in performance (again, approximately 2x using
PIO transfers from the CPU), it seems a reasonable thing to do.

Thanks,
Brent Casavant

-- 
Brent Casavant                          All music is folk music.  I ain't
[EMAIL PROTECTED]                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to