With some tweaks I got a 12x speedup. Original (using Iain's bench with 
100000 iterations):

 0.639475 seconds (4.10 M allocations: 279.236 MB, 1.96% gc time) 
 0.634781 seconds (4.10 M allocations: 279.236 MB, 1.90% gc time)

With 9ab84caa046d687928642a27c30c85336efc876c 
<https://github.com/simonster/crypto/commit/9ab84caa046d687928642a27c30c85336efc876c>
 
from my fork (which avoids allocations, adds inbounds, inlines gf_mult, and 
avoids some branches):

 0.091223 seconds 
 0.090931 seconds

With 3694517e7737fe35f59172666da9971f701189ab 
<https://github.com/simonster/crypto/commit/3694517e7737fe35f59172666da9971f701189ab>,
 
which uses a lookup table for gf_mult:

 0.062077 seconds
 0.062132 seconds

With 6e05894856e2bec372b75cd52ae91f36731d2096 
<https://github.com/simonster/crypto/commit/6e05894856e2bec372b75cd52ae91f36731d2096>,
 
which uglifies shift_rows! for performance:

 0.052652 seconds 
 0.052450 seconds

There is probably a way to make gf_mult faster without using a lookup 
table, since in many cases it's probably doing the same work several times, 
but I didn't put much thought into it.

Simon

On Saturday, September 12, 2015 at 2:29:49 PM UTC-4, Kristoffer Carlsson 
wrote:
>
> I played around with devectorization and made it allocation free but only 
> got the time down by a factor of 2.
>
> Most of the time is spent in gf_mult anyway and I don't know how to 
> optimize that one. If the C library is using a similar function, maybe 
> looking at the generated code to see what is different.
>

Reply via email to