On Sunday, 23 March 2014 at 08:22:32 UTC, Vladimir Panteleev
wrote:
I'm not familiar enough with vector instruction sets of current
CPUs to answer this confidently. E.g. if there exists an
integer vector multiply-and-add operation, then that could be
used for fast software alpha blending. That operation's
restrictions would dictate the optimal memory layout of the
image. E.g. if the operation requires that the bytes to
multiply and add are contiguous in memory, then it follows that
the image should be represented with each channel as a separate
sub-image.
There is the PMADDWD instruction that can be used for 8-bit
blending. I don't think it requires a particular layout from the
implementation, blending would probably be dominated by memory
accesses.