Hi Andy,

On 05/30/13 15:08, Ferenc Rakoczi wrote:
Hi, Andy,

Andy Polyakov wrote:

First of all, RSA512 is essentially irrelevant and no attempt was made to optimize it. So let's just disregard RSA512 results (I have even removed them from above quoted part). Secondly note that our RSA verify is faster.
I never thought verify can be a bottleneck anywhere. So we always concentrated on sign.
Verify is dominated by "single-op" subroutine and we've got it "more right". So we have only RSA sign to figure out. First thing one notices is difference between your and our results from 2.85GHz T4 running Linux:

# rsa 1024 bits 0.000341s 0.000021s   2931.5  46873.8
# rsa 2048 bits 0.001244s 0.000044s    803.9  22569.1
# rsa 4096 bits 0.006203s 0.000387s    161.2   2586.3

Yes, it's not as fast as your engine (except for RSA4096), but difference for 1024- and 2048-bit results is significant to make "how come" question relevant. Is it 32- or 64-bit build you are referring to? If 32, can you collect results for 64-bit build? ./Configure solaris64-sparcv9-[g]cc. One should keep in mind that if 32-bit subroutine is hit by interrupt/exception it has to be restarted. Though it's longer keys that should be affected more... But please test. If 64-bit code delivers same performance as on Linux question would be why is Solaris 32-bit application hit by interrupts/exceptions more than Linux one.
Misaki run the tests now, but the default openssl on solaris is 64-bit, so I think her results are 64-bit. On the other hand, from my experience, on an empty system, interrupts causing recomputation in the
32-bit version are very rare.

As Ferenc noted, I used 64-bit openssl binary to measure the performance.

Let me talk to our performance engineer to see if can collect some performance profile on sign operations.

Thank you

-- misaki


As for RSA sign performance in general. OpenSSL doesn't actually use fastest possible algorithm for exponentiation, but rather more secure, more resistant to side-channel attacks (which should be taken very seriously on massive SMT platform such as T4). There also is possibility that your engine doesn't perform blinding. These are likely to be another bit of explanation to why it's slower.
I understand that but these don't contribute enough to the ~2x speed difference. The engine code only replaces the modular exponentiation, so the blinding is decided by the openssl code. That is, either both runs are with blinding
or both are without it.

Then there also is risk that I was effectively blinded by the fact that I managed to significantly improve original result, and as result stopped looking for ways to improve even further. One thing that I could/should have wondered and wonder now, I'm using conditional fmovd instructions, but how fast are they in the context?

I asked Ferenc Rakoczi (Oracle's engineer who is most familiar with T4 instructions and crypto algorithms) to look at the code. The response from Ferenc is attached below. According to Ferenc, T4 engine code gets rid of a lot of copying and probably that made the difference.

Yes, OpenSSL copies data, *but* it's copy-in and copy-out (with conversion in assembly) per exponentiation, and exponentiation is half number of bits Montgomery operations. I mean for 1024-bit key there is copy-in and copy-out per 512 montsqr/montmul instructions. I find it hard to believe it would be a problem.
I was referring to the copies from the registers to memory and back after each multiplication (a lot of which is unnecessary because the instructions replace the multiplicand with the result, so repeated squarings don't need any setup and for a multiplication step one only has to load the new multiplier before issuing the instruction).

===  Email from Ferenc Rakoczi ===

...

This code does not have the kind of precautions against timing
and cache based attacks as the openssl code - I think on the T4
the timing depends on so many factors that even if the attacker
runs on the same core they could not get accurate enough timings
for the attacks to succeed

I'd argue that it's easier on T4. SMT attack works by instrumenting memory timings. Victim thread accesses memory in very compact sequence and then goes on calculating without any references to memory. High ratio between calculation and memory reference phases works in attacker's favour. Yes, attacker thread would have to end-up on same core, but once it does it gets very good chance to deduce the access pattern.
It might work with 2 threads on a core, when you know when the other one
is doing the exponentiation. With 8 threads, doing all kind of things, I would bet against it. But as I said, the straight line program operations can be modified
to use scattered data without much loss of performance.

- but for the extra paranoid those
defenses can be built in - one can change the algorithm slightly
so that there is always 5 squarings followed by a multiplication
and one can build the apowers table and the load_b function in
such a way that multiple cache lines are touched for each load.

That's what OpenSSL currently does.
I know. And that is almost what our implementation does, the difference is in the cases when the exponent has more than 5 consecutive 0 bits. That can also be
modified easily, without any loss of performance.

Thanks,
              Ferenc


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to