Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-25 Thread Hovav Shacham

On Aug 24, 2008, at 5:20 AM, Peter Gutmann wrote:

Speaking of CPU-specific optimisations, I've seen a few algorithm  
proposals
from the last few years that assume that an algorithm can be scaled  
linearly
in the number of CPU cores, treating a multicore CPU as some kind  
of SIMD
engine with all cores operating in lock-step, or at least engaging  
in some
kind of rendezvous every couple of cycles (for example the recently- 
discussed

MD6 uses a round of 16 steps, if I read the description correctly)


My impressions from Ron's talk were different.  For multicore  
systems, the tree structure of the hash allows parallelism at a much  
higher granularity.  For hardware implementation, the feedback- 
register structure of the round function means that 16 steps can be  
computed in parallel.  I didn't get the sense that Ron intends for  
the second kind of parallelism to be used in software implementations.


Hovav.

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-25 Thread Kevin Brock

Peter Gutmann wrote:

Is there some feature of multicore CPUs that I'm missing, or is it a case of
cryptographers abstracting a bit too much away?  And if it's the latter,
should someone tell them that multicore CPUs don't actually work that way?
  
I can't speak to the former issue, but I seem to remember that the 
numbers Rivest showed at the talk were actual performance measurements, 
not projections, and that it was basically linear.


Kevin

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


multicore hash functions (was: 5x speedup for AES using SSE5?)

2008-08-25 Thread zooko

Hello Peter Gutmann.

I'm working on a contribution to the SHA-3 process, and I've been  
using exactly the sort of abstraction that you describe -- counting  
one computation of a hash compression function as a unit of work  
which could be computed concurrently by some sort of parallel computer.


I vaguely think that once I get this level of analysis done, I should  
add some terms to show how the velocity of data into the computer and  
from core to core is not infinite.


I certainly think that I should code up some actual implementations  
and benchmark them.  However, I don't have a machine available with  
lots of cores -- I'm considering requesting of Sun.com that they lend  
me a T2.  (Despite my earlier declaration to Sun that I had lost  
interest in their stupid architecture since they wouldn't release the  
source to the crypto module.)


Anyway, if you have a better way to think about parallelism of hash  
functions, I'm all ears.


Thanks,

Zooko
---
http://allmydata.org -- Tahoe, the Least-Authority Filesystem
http://allmydata.com -- back up all your files for $5/month

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: 5x speedup for AES using SSE5?

2008-08-25 Thread Brian Gladman
Eric Young wrote:
> Eric Young wrote:
>> I've not looked at it enough yet, but currently I'm doing an AES round
>> in about 140 cycles a block (call it 13 per round plus overhead) on a
>> AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. 
> Urk, correction, I forgot I've recently upgraded from a 2ghz machine to
> 2.5ghz.
> So that should read about 182 cycles per block, and 18 cycles per round.
> I though the number seems strange :-(.  I tent to always quote numbers
> from a 2-3 second run encrypting a 4k buffer, not a machine cycle
> counter over one or two blocks, so I leave myself open to this kind of
> error :-(

The best figure I obtain on an AMD64 system is 11 cycles/byte, which
matches your results (you had me worried for a while with 9 cycles/byte!)

To go 5 times faster than this would mean close to 2 cycles/byte, a
speed that I find hard to believe without hardware acceleration

But a fully byte oriented implementation runs at about 140 cycles/byte
and here the S-Box substitution step is a significant bottleneck.  I too
think the PPERM instruction could be used for this and it seems possible
that this would produce large savings.  So 30 cycles/byte might well be
achievable in this case.

I hence wonder whether this is the comparison that AMD are making.

It is also possible that the PPERM instruction could be used to speed up
the Galois field calculations to produce the S-Box mathematically rather
than by table lookup. I have tried this in the past but it has not
proved competitive.  But PPERM looks interesting here as well.

   Brian Gladman

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]