With some tweaks I got a 12x speedup. Original (using Iain's bench with 100000 iterations):
0.639475 seconds (4.10 M allocations: 279.236 MB, 1.96% gc time) 0.634781 seconds (4.10 M allocations: 279.236 MB, 1.90% gc time) With 9ab84caa046d687928642a27c30c85336efc876c <https://github.com/simonster/crypto/commit/9ab84caa046d687928642a27c30c85336efc876c> from my fork (which avoids allocations, adds inbounds, inlines gf_mult, and avoids some branches): 0.091223 seconds 0.090931 seconds With 3694517e7737fe35f59172666da9971f701189ab <https://github.com/simonster/crypto/commit/3694517e7737fe35f59172666da9971f701189ab>, which uses a lookup table for gf_mult: 0.062077 seconds 0.062132 seconds With 6e05894856e2bec372b75cd52ae91f36731d2096 <https://github.com/simonster/crypto/commit/6e05894856e2bec372b75cd52ae91f36731d2096>, which uglifies shift_rows! for performance: 0.052652 seconds 0.052450 seconds There is probably a way to make gf_mult faster without using a lookup table, since in many cases it's probably doing the same work several times, but I didn't put much thought into it. Simon On Saturday, September 12, 2015 at 2:29:49 PM UTC-4, Kristoffer Carlsson wrote: > > I played around with devectorization and made it allocation free but only > got the time down by a factor of 2. > > Most of the time is spent in gf_mult anyway and I don't know how to > optimize that one. If the C library is using a similar function, maybe > looking at the generated code to see what is different. >
