Hi,

> This patch is a contribution to OpenSSL.
> It includes efficient implementations of Dan Bernstein's Poly1305 
> (authenticator) and ChaCha20 (stream cipher).

Incidentally I'm working on this too and already have ChaCha module.
What I've learned is that ChaCha SIMD performance is a delicate balance
between instruction issue rate, latencies and register availability.
Thing is that at least on x86 processors it is possible to achieve
better performance by arranging data "vertically". I mean following.
Customarily key material is loaded into registers as following (numbers
are *word* offsets, 0xN0 refers to next key block):

   3    2    1    0
   7    6    5    4
 0xb  0xa    9    8
 0xf  0xe  0xd  0xc
0x13 0x12 0x11 0x10
...

, while "vertically" means:

0x30 0x20 0x10  0
0x31 0x21 0x11  1
0x32 0x22 0x12  2
0x33 0x23 0x13  3
...

Naturally this can be used only for longer inputs and question is when
is it appropriate to switch to it. The fact that question is posed means
that I utilize both "horizontal" and "vertical" layouts, for short and
long input lengths respectively. When to switch is governed by results
(quoting attached code, attached for reference):

#                IALU/gcc 4.8(i) 1xSSSE3/SSE2    4xSSSE3     8xAVX2
#
# P4             9.48/+99%       -/22.7(ii)      -
# Core2          7.83/+55%       7.90/8.08       4.35
# Westmere       7.19/+50%       5.60/6.70       3.00
# Sandy Bridge   8.31/+42%       5.45/6.76       2.72
# Ivy Bridge     6.71/+46%       5.40/6.49       2.41
# Haswell        5.92/+43%       5.20/6.45       2.42        1.23
# Silvermont     12.0/+33%       7.75/7.40       7.03(iii)
# Opteron        7.28/+52%       -/14.2(ii)      -
# Bulldozer      9.66/+28%       9.85/11.1       3.06(iv)
# VIA Nano       10.5/+46%       6.72/8.60       6.05
#
# (i)   compared to older gcc 3.x one can observe >2x improvement on
#       most platforms;
# (ii)  as it can be seen, SSE2 performance is too low on legacy
#       processors; NxSSE2 results are naturally better, but not
#       impressively better than IALU ones, which is why you won't
#       find SSE2 code below;
# (iii) this is not optimal result for Atom because of MSROM
#       limitations, SSE2 can do better, but gain is considered too
#       low to justify the [maintenance] effort;
# (iv)  Bulldozer actually executes 4xXOP code path that delivers 2.20;

As it can be seen on most processors it makes sense to switch to
"vertical" when processing more than one block. Not all, but most.

Additional notes.

Note that there is no AVX1 code. Rationale is that it provides only
modest improvement over SSSE3, too little to justify maintenance costs.

You can notice that attached code operates on 32-bit counter. This is
done because 64-bit counter is not needed in TLS context, while
operating on 32-bit one makes programming effort easier. But for
situations when 64-bit counter can be required, it would be caller's
responsibility to trace overflows (in manner similar to one implemented
in ctr128.c). But "caller" doesn't refer to "application programmer",
but OpenSSL code that calls assembly. I mean the 32-bit counter issue
won't be visible to developers.

The reason for why AVX512 is not implemented yet is following. As
mentioned in the beginning, in this case performance is matter of
delicate balance between instruction issue rate, latencies and register
availability. As result assessment and choice of approach is left till
moment when more detailed data is available about AVX512 hardware.

There even are 32-bit x86 (not final) and ARM results available:

#                       IALU/gcc        4xSSSE3
# Pentium               17.5/+80%
# PIII                  14.2/+60%
# P4                    18.6/+84%
# Core2                 9.56/+89%       4.90
# Westmere              9.50/+45%       3.50
# Sandy Bridge          10.5/+47%       3.25
# Haswell               8.15/+50%       2.85
# Silvermont            17.4/+36%       8.35
# Opteron               10.2/+54%
# Bulldozer             13.4/+50%       4.40

#                       IALU/gcc        1xNEON      3xNEON+1xIALU
# Cortex-A5             19.3(*)/+130%   21.8        14.1
# Cortex-A8             10.5(*)/+110%   13.9        6.35
# Cortex-A9             12.9(**)/+170%  14.3        6.50
# Snapdragon S4         11.5/+150%      13.6        4.90
#
# (*)   most "favourable" result for aligned data on little-endian
#       processor, result for misaligned data is 10-15% lower;
# (**)  this result is a trade-off: it can be improved by 20%,
#       but then Snapdragon S4 and Cortex-A8 results get
#       20-25% worse;

As for poly1305. I'm going to look into it next. Just like with ChaCha
the scope will not be limited to just latest x86_64 processors and ELF
platforms. This doesn't preclude possibility that parts of this code
submission will be used.


Attachment: chacha-x86_64.pl
Description: Perl program

Reply via email to