Hi, > This patch is a contribution to OpenSSL. > It includes efficient implementations of Dan Bernstein's Poly1305 > (authenticator) and ChaCha20 (stream cipher).
Incidentally I'm working on this too and already have ChaCha module. What I've learned is that ChaCha SIMD performance is a delicate balance between instruction issue rate, latencies and register availability. Thing is that at least on x86 processors it is possible to achieve better performance by arranging data "vertically". I mean following. Customarily key material is loaded into registers as following (numbers are *word* offsets, 0xN0 refers to next key block): 3 2 1 0 7 6 5 4 0xb 0xa 9 8 0xf 0xe 0xd 0xc 0x13 0x12 0x11 0x10 ... , while "vertically" means: 0x30 0x20 0x10 0 0x31 0x21 0x11 1 0x32 0x22 0x12 2 0x33 0x23 0x13 3 ... Naturally this can be used only for longer inputs and question is when is it appropriate to switch to it. The fact that question is posed means that I utilize both "horizontal" and "vertical" layouts, for short and long input lengths respectively. When to switch is governed by results (quoting attached code, attached for reference): # IALU/gcc 4.8(i) 1xSSSE3/SSE2 4xSSSE3 8xAVX2 # # P4 9.48/+99% -/22.7(ii) - # Core2 7.83/+55% 7.90/8.08 4.35 # Westmere 7.19/+50% 5.60/6.70 3.00 # Sandy Bridge 8.31/+42% 5.45/6.76 2.72 # Ivy Bridge 6.71/+46% 5.40/6.49 2.41 # Haswell 5.92/+43% 5.20/6.45 2.42 1.23 # Silvermont 12.0/+33% 7.75/7.40 7.03(iii) # Opteron 7.28/+52% -/14.2(ii) - # Bulldozer 9.66/+28% 9.85/11.1 3.06(iv) # VIA Nano 10.5/+46% 6.72/8.60 6.05 # # (i) compared to older gcc 3.x one can observe >2x improvement on # most platforms; # (ii) as it can be seen, SSE2 performance is too low on legacy # processors; NxSSE2 results are naturally better, but not # impressively better than IALU ones, which is why you won't # find SSE2 code below; # (iii) this is not optimal result for Atom because of MSROM # limitations, SSE2 can do better, but gain is considered too # low to justify the [maintenance] effort; # (iv) Bulldozer actually executes 4xXOP code path that delivers 2.20; As it can be seen on most processors it makes sense to switch to "vertical" when processing more than one block. Not all, but most. Additional notes. Note that there is no AVX1 code. Rationale is that it provides only modest improvement over SSSE3, too little to justify maintenance costs. You can notice that attached code operates on 32-bit counter. This is done because 64-bit counter is not needed in TLS context, while operating on 32-bit one makes programming effort easier. But for situations when 64-bit counter can be required, it would be caller's responsibility to trace overflows (in manner similar to one implemented in ctr128.c). But "caller" doesn't refer to "application programmer", but OpenSSL code that calls assembly. I mean the 32-bit counter issue won't be visible to developers. The reason for why AVX512 is not implemented yet is following. As mentioned in the beginning, in this case performance is matter of delicate balance between instruction issue rate, latencies and register availability. As result assessment and choice of approach is left till moment when more detailed data is available about AVX512 hardware. There even are 32-bit x86 (not final) and ARM results available: # IALU/gcc 4xSSSE3 # Pentium 17.5/+80% # PIII 14.2/+60% # P4 18.6/+84% # Core2 9.56/+89% 4.90 # Westmere 9.50/+45% 3.50 # Sandy Bridge 10.5/+47% 3.25 # Haswell 8.15/+50% 2.85 # Silvermont 17.4/+36% 8.35 # Opteron 10.2/+54% # Bulldozer 13.4/+50% 4.40 # IALU/gcc 1xNEON 3xNEON+1xIALU # Cortex-A5 19.3(*)/+130% 21.8 14.1 # Cortex-A8 10.5(*)/+110% 13.9 6.35 # Cortex-A9 12.9(**)/+170% 14.3 6.50 # Snapdragon S4 11.5/+150% 13.6 4.90 # # (*) most "favourable" result for aligned data on little-endian # processor, result for misaligned data is 10-15% lower; # (**) this result is a trade-off: it can be improved by 20%, # but then Snapdragon S4 and Cortex-A8 results get # 20-25% worse; As for poly1305. I'm going to look into it next. Just like with ChaCha the scope will not be limited to just latest x86_64 processors and ELF platforms. This doesn't preclude possibility that parts of this code submission will be used.
chacha-x86_64.pl
Description: Perl program