On Thu, 26 Jun 2014, Joseph M. Schwartz wrote: > > > It seems that the implementation fails just like described in the blog > > > post, as soon as ChaCha() is called with a length which is not a > > > multiple of 64, all further uses of the method produce incorrect > > > results. > > > > Thanks for pointing this out - I've just fixed this in -current. The > > underlying ChaCha implementation (from djb) is written for single-shot > > use (as exposed via CRYPTO_chacha_20(), which is also used for > > ChaCha20Poly1305). > > I obviously overlooked this when I added the ChaCha() and EVP interfaces. > > > > Regress tests already existed, however they did not trigger this specific > > issue. They've now been extended to cover the ChaCha interface (which was > > already tested via the EVP regress) with partial/single-byte writes. > > Thank you for your quick fixes. I tested your changes with the other tests > provided here too, and it all appears to be good now.
Thanks for confirming. > > > The blog's author provided an implementation which does not suffer from > > > this problem, along with test vectors: > > > http://chacha20.insanecoding.org/ The license on that code appears to > > > be friendly, although I don't know if the code itself is any good. > > > > Performance-wise the implementation would be rather ordinary (as noted in > > the > > code) - the existing implementation does 64-byte blocks in 4-byte pieces, > > whereas this implementation does a byte a time. > > Since you brought up performance, curiosity got the better of me, so I > decided to run some quick benchmarks to see what the difference is. I > modified test.c to change the random tests from 1000 to 1000000 iterations > per loop, and timed both implementations to see what the difference was. > > Your code on my system (amd64) compiled with gcc 4.7 and -O3 averages > ~6.976u, whereas the other code averages ~6.344u. > > I found this quite surprising that the code which is advertised as > unoptimized and operates a byte at a time runs quicker. I checked the > assembly GCC is producing for that byte at a time code in the xoring > section, and I see GCC is producing what looks like SSE instructions! I > didn't even know GCC was able to do that without help. > > Maybe something about the optimizations in your code are counter > productive? Or is my benchmark methodology somehow flawed? The random tests are encrypting small blocks of data (the first between 0 and 15 bytes per write, the second between 65 and 128) using the same key and IV - this is basically the worst case scenario and the byte-at-a-time algorithm is almost certainly more performant. However, this is not likely to be a good example of real-world use - it probably matches SSH interactive use, although even then the IV/block counter is reset between calls. For SSH bulk you're likely to be encrypting 8-16KB blocks at a time, for SSL up to 16KB blocks and for file encryption, anything up to 2^70 bytes. If you are interested in doing further benchmarks it would be interesting to measure performance for 8KB, 16KB and larger blocks, changing the IV in between. Encrypting/decrypting a single large file (using a single key/IV) would also be an interesting measurement. It is also worth noting that the performance of the block optimised code is still going to be well below the performance of a well implemented assembly version. > > Additionally, the code is > > pretty horrific from a style perspective. > > I take it you mean the testing code, as that does look awful. > The encryption code itself IMO anyway looks neat and clean, and much more > understandable than what currently is in your chacha_encrypt_bytes() > method. Just to be clear, chacha_encrypt_bytes() is not my code - the original was written by D. J. Bernstein (http://cr.yp.to/chacha.html). I just imported it and applied KNF to it. > Looking at this new code, I also see it has an interesting comment > regarding incrementing the counter. Is there any validity to the approach > its taking? Would that be more secure for very long streams? It should only make a difference if you encrypt 2^70 bytes using a single key and IV. For the time being I do not see that being a problem. Any sensible application is going to chunk the data and change the IV. -- "Action without study is fatal. Study without action is futile." -- Mary Ritter Beard
