>> I'm about to light up the build farm with a trial commit of the
>> compiler instructions stuff.
> Amazingly that seemed to work.

Thanks for committing. Sorry about missing the .h file from the patch.
The two commits look good to me.

I can confirm that compiling with CFLAGS="-O2 -march=native" will
vectorize the committed code on GCC 4.7.

I also checked the situation on clang. clang-3.2 isn't able to
vectorize the loop even with vectorization options. I will check what
is stopping it. If any volunteer has a working build setup with ICC or
MSVC and is willing to run a couple of test compiles, I think we can
achieve vectorization there too.

> ISTM that we also need this patch to put memory barriers in place
> otherwise the code might be rearranged.

The compiler and CPU both have to preserve correctness when
rearranging code, so I don't think we care about it here. It might
matter if these routine could be called concurrently by multiple
backends for a single buffer, but in that case memory barriers won't
be enough, we'd need full exclusion.

