You can use the inline assembler for shufps, also for AVX.
Of course you can, I forgot to mention that. I do that in parts of pfft when it is compiled using DMD (but only for SSE). But because of the overhead of copying values from the stack to registers and back to the stack or calling a function it only makes sense to do that when the chunk of code you are replacing with inline assmbly takes longer than a few cycles. This forces you to write larger chunks of code in inline assembly, which is not always practical.
