On Thu, 26 Jun 2014, Joseph M. Schwartz wrote:
> > > It seems that the implementation fails just like described in the blog
> > > post, as soon as ChaCha() is called with a length which is not a
> > > multiple of 64, all further uses of the method produce incorrect
> > > results.
> >
> > Thanks for pointing this out - I've just fixed this in -current. The
> > underlying ChaCha implementation (from djb) is written for single-shot
> > use (as exposed via CRYPTO_chacha_20(), which is also used for
> > ChaCha20Poly1305).
> > I obviously overlooked this when I added the ChaCha() and EVP interfaces.
> >
> > Regress tests already existed, however they did not trigger this specific
> > issue. They've now been extended to cover the ChaCha interface (which was
> > already tested via the EVP regress) with partial/single-byte writes.
>
> Thank you for your quick fixes. I tested your changes with the other tests
> provided here too, and it all appears to be good now.

Thanks for confirming.

> > > The blog's author provided an implementation which does not suffer from
> > > this problem, along with test vectors:
> > > http://chacha20.insanecoding.org/ The license on that code appears to
> > > be friendly, although I don't know if the code itself is any good.
> >
> > Performance-wise the implementation would be rather ordinary (as noted in
> > the
> > code) - the existing implementation does 64-byte blocks in 4-byte pieces,
> > whereas this implementation does a byte a time.
>
> Since you brought up performance, curiosity got the better of me, so I
> decided to run some quick benchmarks to see what the difference is. I
> modified test.c to change the random tests from 1000 to 1000000 iterations
> per loop, and timed both implementations to see what the difference was.
>
> Your code on my system (amd64) compiled with gcc 4.7 and -O3 averages
> ~6.976u, whereas the other code averages ~6.344u.
>
> I found this quite surprising that the code which is advertised as
> unoptimized and operates a byte at a time runs quicker. I checked the
> assembly GCC is producing for that byte at a time code in the xoring
> section, and I see GCC is producing what looks like SSE instructions! I
> didn't even know GCC was able to do that without help.
>
> Maybe something about the optimizations in your code are counter
> productive? Or is my benchmark methodology somehow flawed?

The random tests are encrypting small blocks of data (the first between 0 and 
15 bytes per write, the second between 65 and 128) using the same key and 
IV - this is basically the worst case scenario and the byte-at-a-time 
algorithm is almost certainly more performant.

However, this is not likely to be a good example of real-world use - it 
probably matches SSH interactive use, although even then the IV/block counter 
is reset between calls. For SSH bulk you're likely to be encrypting 8-16KB 
blocks at a time, for SSL up to 16KB blocks and for file encryption, anything 
up to 2^70 bytes.

If you are interested in doing further benchmarks it would be interesting to 
measure performance for 8KB, 16KB and larger blocks, changing the IV in 
between. Encrypting/decrypting a single large file (using a single key/IV) 
would also be an interesting measurement.

It is also worth noting that the performance of the block optimised code is 
still going to be well below the performance of a well implemented assembly 
version.

> > Additionally, the code is
> > pretty horrific from a style perspective.
>
> I take it you mean the testing code, as that does look awful.
> The encryption code itself IMO anyway looks neat and clean, and much more
> understandable than what currently is in your chacha_encrypt_bytes()
> method.

Just to be clear, chacha_encrypt_bytes() is not my code - the original was 
written by D. J. Bernstein (http://cr.yp.to/chacha.html). I just imported it 
and applied KNF to it.

> Looking at this new code, I also see it has an interesting comment
> regarding incrementing the counter. Is there any validity to the approach
> its taking? Would that be more secure for very long streams?

It should only make a difference if you encrypt 2^70 bytes using a single key 
and IV. For the time being I do not see that being a problem. Any sensible 
application is going to chunk the data and change the IV.
-- 

    "Action without study is fatal. Study without action is futile."
        -- Mary Ritter Beard

Reply via email to