I played around with devectorization and made it allocation free but only got the time down by a factor of 2.
Most of the time is spent in gf_mult anyway and I don't know how to optimize that one. If the C library is using a similar function, maybe looking at the generated code to see what is different.
