I played around with devectorization and made it allocation free but only got 
the time down by a factor of 2.

Most of the time is spent in gf_mult anyway and I don't know how to optimize 
that one. If the C library is using a similar function, maybe looking at the 
generated code to see what is different.

Reply via email to