Hi Andy,
On 05/30/13 15:08, Ferenc Rakoczi wrote:
Hi, Andy,
Andy Polyakov wrote:
First of all, RSA512 is essentially irrelevant and no attempt was
made to optimize it. So let's just disregard RSA512 results (I have
even removed them from above quoted part). Secondly note that our RSA
verify is faster.
I never thought verify can be a bottleneck anywhere. So we always
concentrated on sign.
Verify is dominated by "single-op" subroutine and we've got it "more
right". So we have only RSA sign to figure out. First thing one
notices is difference between your and our results from 2.85GHz T4
running Linux:
# rsa 1024 bits 0.000341s 0.000021s 2931.5 46873.8
# rsa 2048 bits 0.001244s 0.000044s 803.9 22569.1
# rsa 4096 bits 0.006203s 0.000387s 161.2 2586.3
Yes, it's not as fast as your engine (except for RSA4096), but
difference for 1024- and 2048-bit results is significant to make "how
come" question relevant. Is it 32- or 64-bit build you are referring
to? If 32, can you collect results for 64-bit build? ./Configure
solaris64-sparcv9-[g]cc. One should keep in mind that if 32-bit
subroutine is hit by interrupt/exception it has to be restarted.
Though it's longer keys that should be affected more... But please
test. If 64-bit code delivers same performance as on Linux question
would be why is Solaris 32-bit application hit by
interrupts/exceptions more than Linux one.
Misaki run the tests now, but the default openssl on solaris is
64-bit, so I think her results are 64-bit.
On the other hand, from my experience, on an empty system, interrupts
causing recomputation in the
32-bit version are very rare.
As Ferenc noted, I used 64-bit openssl binary to measure the performance.
Let me talk to our performance engineer to see if can collect some
performance profile on sign operations.
Thank you
-- misaki
As for RSA sign performance in general. OpenSSL doesn't actually use
fastest possible algorithm for exponentiation, but rather more
secure, more resistant to side-channel attacks (which should be taken
very seriously on massive SMT platform such as T4). There also is
possibility that your engine doesn't perform blinding. These are
likely to be another bit of explanation to why it's slower.
I understand that but these don't contribute enough to the ~2x speed
difference. The engine code only replaces the
modular exponentiation, so the blinding is decided by the openssl
code. That is, either both runs are with blinding
or both are without it.
Then there also is risk that I was effectively blinded by the fact
that I managed to significantly improve original result, and as
result stopped looking for ways to improve even further. One thing
that I could/should have wondered and wonder now, I'm using
conditional fmovd instructions, but how fast are they in the context?
I asked Ferenc Rakoczi (Oracle's engineer who is most familiar with
T4 instructions and crypto algorithms) to look at the code. The
response from Ferenc is attached below. According to Ferenc, T4
engine code gets rid of a lot of copying and probably that made the
difference.
Yes, OpenSSL copies data, *but* it's copy-in and copy-out (with
conversion in assembly) per exponentiation, and exponentiation is
half number of bits Montgomery operations. I mean for 1024-bit key
there is copy-in and copy-out per 512 montsqr/montmul instructions. I
find it hard to believe it would be a problem.
I was referring to the copies from the registers to memory and back
after each multiplication (a lot of which is
unnecessary because the instructions replace the multiplicand with the
result, so repeated squarings don't need any setup
and for a multiplication step one only has to load the new multiplier
before issuing the instruction).
=== Email from Ferenc Rakoczi ===
...
This code does not have the kind of precautions against timing
and cache based attacks as the openssl code - I think on the T4
the timing depends on so many factors that even if the attacker
runs on the same core they could not get accurate enough timings
for the attacks to succeed
I'd argue that it's easier on T4. SMT attack works by instrumenting
memory timings. Victim thread accesses memory in very compact
sequence and then goes on calculating without any references to
memory. High ratio between calculation and memory reference phases
works in attacker's favour. Yes, attacker thread would have to end-up
on same core, but once it does it gets very good chance to deduce the
access pattern.
It might work with 2 threads on a core, when you know when the other one
is doing the exponentiation. With 8 threads, doing all kind of things,
I would
bet against it. But as I said, the straight line program operations
can be modified
to use scattered data without much loss of performance.
- but for the extra paranoid those
defenses can be built in - one can change the algorithm slightly
so that there is always 5 squarings followed by a multiplication
and one can build the apowers table and the load_b function in
such a way that multiple cache lines are touched for each load.
That's what OpenSSL currently does.
I know. And that is almost what our implementation does, the
difference is in the
cases when the exponent has more than 5 consecutive 0 bits. That can
also be
modified easily, without any loss of performance.
Thanks,
Ferenc
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [email protected]