Having written a fair amount of Assembly, it's kind of hard to optimize. Most optimizations are done with taking a different approach to solve your problem and not some cleaver SIMD instruction, in fact a random single SIMD can really slow down your code...
I think this is what Nim does really well because of template and macros you can try a ton of different approaches and see what fits best. I love optimizing stuff.
