> There seems to be a bug in the AVX2 codepath in poly1305-x86.pl. I have not > attempted to debug this, but I have attached a test file which produces > different output in normal and AVX2 codepaths. Our existing poly1305 > implementation agrees with the former. > > $ OPENSSL_ia32cap=0 ./poly1305_test > PASS > $ ./poly1305_test > Poly1305 test failed.t > got: 2e65f0054e36505687d937ff5e8ed112 > expected: 69d28f73dd09d39a92aa179da354b7ea > > You may wish to generalize that Poly1305_Update pattern into your own > tests. This is what I did to catch this: > https://boringssl-review.googlesource.com/#/c/7223/
About choice of test vector lengths. From viewpoint of exercising different branches lengths 2048 through 2048+15 would exercise same vector path. What one needs is combinations of 64*n+16*m. But you do compensate for this by exercising varying what you've called excess lengths. I suppose what I'm saying is that that +15 thing is kind of redundant in the context. This is because last block is always processed separately. > By the way, this assembly code is quite complicated. Yeah, but what you gonna do, vectorizing is hard in this case... > I wasn't able to find > problems in the others (I tested armv4, armv8, x86, and x86_64), but I'm > far from confident I've covered all the cases. > > With the caveat that I'm no assembly programmer, much of the complexity > seems to come the SIMD code needing a multiple of 2 or 4 blocks and the > implementation converting internal state back and forth from base 2^26 and > 2^64 and handling excess blocks slightly differently in different cases. (I > counted nine distinct codepaths to test in the x86_64 AVX codepath alone.) While it's correct that excess blocks are handled though 2^26<>2^64 conversion in 64-bit modules, in 32-bit modules 2^32->2^26 is one way. I mean in 32-bit modules even excess blocks are handled by vector code (because it's faster than scalar code even if you have to discard some data). Though "excess blocks" can be a little bit misleading term in the context, because it kind of makes you think that processing is done as [2|4]*n + remainder, while it's actually processed as remainder + [2|4]n. [For reference, [2|4] will become [2|4|8] as soon as support for AVX512 is added]. All this is provided that input length is longer than certain limit. Vectorized procedure requires expensive setup which is postponed till we know that input is long enough to justify additional overhead (see https://www.openssl.org/blog/blog/2016/02/15/poly1305-revised/ for background information). > The C code already buffers up to 16-byte blocks. Did you consider buffering > up to 32 or 64 bytes in C when the SIMD code called for it? Yes, but here is the thing. You'd need to be able to handle all possible remainders in either case. So I figured it would be appropriate to take "straight-forward" version for a "burn-out" to real world prior next step ;-) > I haven't tried this, so perhaps the performance costs are prohibitive, but > if the costs are modest, the simplifications may be worth it. No, it's perfectly reasonable. One should recognize though that you are unlikely to gain like a lot in TLS case, because call pattern is rather predictable. The noticeable gain would be observed in "wilder" mix of calls. BTW, at some point I also plan to make e_chacha_poly1305.c process couple of 4KB chunks at a time in order to improve cache locality... -- Ticket here: http://rt.openssl.org/Ticket/Display.html?id=4346 Please log in as guest with password guest if prompted -- openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev