Hi Everyone, We are getting ready to add benchmarks for a selection of RNGs provided by the library. The results are interesting.
Running RDRAND and RDSEED benchmarks did not pass the sniff test. They were under-performing though it went unnoticed. RDRAND and RDSEED are being reworked, and it should increase throughput by about 50%. Here are some of the details: * RDRAND and RDSEED throughput varies wildly depending on processor family, cpu sub-architecture and processor manufacturer. RDSEED runs at anywhere from 1/2 to 1/5 of the rate of RDRAND on Intel hardware. * The reworked generators always fulfill a request if the function parameters are correct and the hardware is present. On failure, they immediately retry automatically without the need for user intervention. * The retry parameters were removed. RDRAND never fails, and it was hard to predict when there were "enough" retries for RDSEED. Now that the generators immediately retry, there was no need for an external retry count. * RDRAND and RDSEED's GenerateBlock was crippled by C++ exceptions. Setting up the C++ exceptions dominated the function. I estimated we needed about 5 to 7 instructions for generation and book-keeping in a tight loop: generate, test failure, write value, increment pointer, decrement size. We were getting 50 to 70 additional instructions as the exception frame was setup and torn down. Keep in mind the library often "chunks" a request, so 10K-bytes might be broken into multiple 512-byte or 1024-byte requests. * RDRAND and RDSEED's GenerateBlock had two sources of C++ exceptions. First was hardware and parameter validation, and second was retry counts. Now that the exceptions are removed, GenerateBlock is an x86_64 leaf function without a red zone. * Internal representations, like NASM_RDRAND_GenerateBlock and MASM_RDRAND_GenerateBlock, now return void (i.e., no return value). The return values are no longer needed since GenerateBlock will fulfill the request or crash if the hardware is not present. It no longer throws an exception. * We tested a parallelized implementation using OpenMP with a parallel for-loop. The generator lost 7 MiB/s for each OMP thread on a particular test machine using GCC 4 and 5. That is, we could baseline at 70 MiB/s; and then achieve 63 MiB/s total using 2 OMP threads; 56 MiB/s using 3 OMP threads; 47 MiB/s using 4 OMP threads. Jeff -- -- You received this message because you are subscribed to the "Crypto++ Users" Google Group. To unsubscribe, send an email to cryptopp-users-unsubscr...@googlegroups.com. More information about Crypto++ and this group is available at http://www.cryptopp.com. --- You received this message because you are subscribed to the Google Groups "Crypto++ Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to cryptopp-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.