Re: MONTMUL performance: t4 engine vs inlined t4

Misaki.Miyashita Thu, 30 May 2013 15:57:48 -0700

Hi Andy,

On 05/30/13 15:08, Ferenc Rakoczi wrote:

Hi, Andy,
Andy Polyakov wrote:
First of all, RSA512 is essentially irrelevant and no attempt wasmade to optimize it. So let's just disregard RSA512 results (I haveeven removed them from above quoted part). Secondly note that our RSAverify is faster.
I never thought verify can be a bottleneck anywhere. So we alwaysconcentrated on sign.
Verify is dominated by "single-op" subroutine and we've got it "moreright". So we have only RSA sign to figure out. First thing onenotices is difference between your and our results from 2.85GHz T4running Linux:
# rsa 1024 bits 0.000341s 0.000021s   2931.5  46873.8
# rsa 2048 bits 0.001244s 0.000044s    803.9  22569.1
# rsa 4096 bits 0.006203s 0.000387s    161.2   2586.3
Yes, it's not as fast as your engine (except for RSA4096), butdifference for 1024- and 2048-bit results is significant to make "howcome" question relevant. Is it 32- or 64-bit build you are referringto? If 32, can you collect results for 64-bit build? ./Configuresolaris64-sparcv9-[g]cc. One should keep in mind that if 32-bitsubroutine is hit by interrupt/exception it has to be restarted.Though it's longer keys that should be affected more... But pleasetest. If 64-bit code delivers same performance as on Linux questionwould be why is Solaris 32-bit application hit byinterrupts/exceptions more than Linux one.
Misaki run the tests now, but the default openssl on solaris is64-bit, so I think her results are 64-bit.On the other hand, from my experience, on an empty system, interruptscausing recomputation in the
32-bit version are very rare.


As Ferenc noted, I used 64-bit openssl binary to measure the performance.

Let me talk to our performance engineer to see if can collect someperformance profile on sign operations.


Thank you

-- misaki

As for RSA sign performance in general. OpenSSL doesn't actually usefastest possible algorithm for exponentiation, but rather moresecure, more resistant to side-channel attacks (which should be takenvery seriously on massive SMT platform such as T4). There also ispossibility that your engine doesn't perform blinding. These arelikely to be another bit of explanation to why it's slower.
I understand that but these don't contribute enough to the ~2x speeddifference. The engine code only replaces themodular exponentiation, so the blinding is decided by the opensslcode. That is, either both runs are with blinding
or both are without it.
Then there also is risk that I was effectively blinded by the factthat I managed to significantly improve original result, and asresult stopped looking for ways to improve even further. One thingthat I could/should have wondered and wonder now, I'm usingconditional fmovd instructions, but how fast are they in the context?
I asked Ferenc Rakoczi (Oracle's engineer who is most familiar withT4 instructions and crypto algorithms) to look at the code. Theresponse from Ferenc is attached below. According to Ferenc, T4engine code gets rid of a lot of copying and probably that made thedifference.
Yes, OpenSSL copies data, *but* it's copy-in and copy-out (withconversion in assembly) per exponentiation, and exponentiation ishalf number of bits Montgomery operations. I mean for 1024-bit keythere is copy-in and copy-out per 512 montsqr/montmul instructions. Ifind it hard to believe it would be a problem.
I was referring to the copies from the registers to memory and backafter each multiplication (a lot of which isunnecessary because the instructions replace the multiplicand with theresult, so repeated squarings don't need any setupand for a multiplication step one only has to load the new multiplierbefore issuing the instruction).
===  Email from Ferenc Rakoczi ===

...

This code does not have the kind of precautions against timing
and cache based attacks as the openssl code - I think on the T4
the timing depends on so many factors that even if the attacker
runs on the same core they could not get accurate enough timings
for the attacks to succeed
I'd argue that it's easier on T4. SMT attack works by instrumentingmemory timings. Victim thread accesses memory in very compactsequence and then goes on calculating without any references tomemory. High ratio between calculation and memory reference phasesworks in attacker's favour. Yes, attacker thread would have to end-upon same core, but once it does it gets very good chance to deduce theaccess pattern.
It might work with 2 threads on a core, when you know when the other one
is doing the exponentiation. With 8 threads, doing all kind of things,I wouldbet against it. But as I said, the straight line program operationscan be modified
to use scattered data without much loss of performance.
- but for the extra paranoid those
defenses can be built in - one can change the algorithm slightly
so that there is always 5 squarings followed by a multiplication
and one can build the apowers table and the load_b function in
such a way that multiple cache lines are touched for each load.
That's what OpenSSL currently does.
I know. And that is almost what our implementation does, thedifference is in thecases when the exponent has more than 5 consecutive 0 bits. That canalso be
modified easily, without any loss of performance.

Thanks,
              Ferenc


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Re: MONTMUL performance: t4 engine vs inlined t4

Reply via email to