Hi Everyone,

We are getting ready to add benchmarks for a selection of RNGs provided by 
the library. The results are interesting.

Running RDRAND and RDSEED benchmarks did not pass the sniff test. They were 
under-performing though it went unnoticed. RDRAND and RDSEED are being 
reworked, and it should increase throughput by about 50%. Here are some of 
the details:

* RDRAND and RDSEED throughput varies wildly depending on processor family, 
cpu sub-architecture and processor manufacturer. RDSEED runs at anywhere 
from 1/2 to 1/5 of the rate of RDRAND on Intel hardware.

* The reworked generators always fulfill a request if the function 
parameters are correct and the hardware is present. On failure, they 
immediately retry automatically without the need for user intervention.

* The retry parameters were removed. RDRAND never fails, and it was hard to 
predict when there were "enough" retries for RDSEED. Now that the 
generators immediately retry, there was no need for an external retry count.

* RDRAND and RDSEED's GenerateBlock was crippled by C++ exceptions. Setting 
up the C++ exceptions dominated the function. I estimated we needed about 5 
to 7 instructions for generation and book-keeping in a tight loop: 
generate, test failure, write value, increment pointer, decrement size. We 
were getting 50 to 70 additional instructions as the exception frame was 
setup and torn down. Keep in mind the library often "chunks" a request, so 
10K-bytes might be broken into multiple 512-byte or 1024-byte requests.

* RDRAND and RDSEED's GenerateBlock had two sources of C++ exceptions. 
First was hardware and parameter validation, and second was retry counts. 
Now that the exceptions are removed, GenerateBlock is an x86_64 leaf 
function without a red zone.

* Internal representations, like NASM_RDRAND_GenerateBlock and 
MASM_RDRAND_GenerateBlock, now return void (i.e., no return value). The 
return values are no longer needed since GenerateBlock will fulfill the 
request or crash if the hardware is not present. It no longer throws an 
exception.

* We tested a parallelized implementation using OpenMP with a parallel 
for-loop. The generator lost 7 MiB/s for each OMP thread on a particular 
test machine using GCC 4 and 5. That is, we could baseline at 70 MiB/s; and 
then achieve 63 MiB/s total using 2 OMP threads; 56 MiB/s using 3 OMP 
threads; 47 MiB/s using 4 OMP threads. 

Jeff

-- 
-- 
You received this message because you are subscribed to the "Crypto++ Users" 
Google Group.
To unsubscribe, send an email to cryptopp-users-unsubscr...@googlegroups.com.
More information about Crypto++ and this group is available at 
http://www.cryptopp.com.
--- 
You received this message because you are subscribed to the Google Groups 
"Crypto++ Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to cryptopp-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to