On Wed, Mar 27, 2013 at 10:21:34AM +0400, Solar Designer wrote: > Another replacement I am considering is sort of "wide Blowfish" - an > extension of Blowfish to SIMD vectors - which may be shown as being > about as fast to compute as normal Blowfish (key setup) in bcrypt, but > no quicker to attack than bcrypt (including on GPUs and specialized > devices).
As it relates to attacks with ASICs, the Passwords^12 revision of scrypt slides: http://daemonology.net/papers/scrypt-2012.pdf gives a table on slide 51 comparing different crypto primitives for use in scrypt. Specifically, it compares SHA-256, Blowfish, AES-128, Salsa20/8, and Keccak - and it shows how Salsa20/8 is best among these. The score is calculated as SW^2/HW, where SW and HW are performance in software and hardware (for one crypto core), respectively. This formula is such because a faster crypto primitive lets us not only compute it more times, but also use more memory given the same target duration. The die area consumed by the crypto core is not considered because the die area cost is assumed to be almost exclusively from the memory needs. The table gives: Blowfish: SW = 800 Mbps, HW = 1000 Mbps, Score = 640 AES-128: SW = 1200 Mbps, HW = 40000 Mbps, Score = 36 Salsa20/8: SW = 2000 Mbps, HW = 2000 Mbps, Score = 2000 The HW speeds given here might not actually be directly comparable as I think the fast AES core's latency might not be proportionally lower, and for use in scrypt it matters - although it depends on how AES is used. Now let's see what happens if we manage to extend Blowfish to 128-bit SIMD while not reducing its speed per 32-bit SIMD vector element: (4*800)^2/(4*1000) = 2560 For an actual block cipher, such extension could be difficult, but in scrypt our use of Blowfish would be less demanding, so I think this is possible. Two other knobs are the rounds count and the instruction-level parallelism factor. For example, we might halve the rounds count, but mix instructions from two instances of the crypto primitive (computed on slightly different inputs), then XOR the two outputs together. Would the resulting scheme remain secure enough for our data mixing use? Perhaps it would. Assuming that our CPU can execute the two inter-mixed instances as fast as it does one instance (per round), we get: For 2*Salsa20/4: (2*2000)^2/(2*2000) = 4000 For 2*wideBlowfish/8: (2*4*800)^2/(2*4*1000) = 5120 However, as we make the crypto primitive faster, we're more likely to bump into the memory bandwidth, which would limit the actual score. Alexander
