On Thursday, 2 June 2016 at 13:32:51 UTC, ZILtoid1991 wrote:
On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
Could you also paste the D version of your code? Perhaps the compiler (LDC, GDC) will generate similarly vectorized code that is inlinable, etc.

ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - src[0]))>>8); dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - src[0]))>>8); dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - src[0]))>>8);
*p = dest2;

The main problem with this is that it's much slower, even if I would calculate the alpha blending values once. The assembly code does not seem to have higher impact than the "replace if alpha = 255" algorithm:

if(src[0] == 255){
*p = src;
}

It also seems I have a quite few problems with the assembly code, mostly with the pmulhuw command (it returns the higher 16 bit of the result, I need the lower 16 bit as unsigned), also with the pointers, as the read outs and write backs doesn't land to their correct places, sometimes resulting in a flickering screen or wrong colors affecting neighboring pixels. Current assembly code:

I'd say the major portion of your speedup happens to be because you're trying to do 3 things at once. Rather specifically, because you're working with 3 8bit colors, you have 24bits of data to work with, and by adding 8bits for fixed floating point you can do a multiply and do 4 small multiplies in a single command.

You'd probably get a similar effect from bit shifting before and after the results. Since you're working with 3 colors and the alpha/multiplier... This assumes you do it without MMX. (reduces 6 multiplies to a mere 2)

ulong tmp1 = (src[1] << 32) | (src[2] << 16) | src[3];
ulong tmp2 = (dest2[1] << 32) | (dest2[2] << 16) | dest2[3];

tmp1 *= src[0]+1;
tmp1 += tmp2*(256 - src[0]);

src[3] = (tmp1 >> 8) & 0xff;
src[2] = (tmp1 >> 24) & 0xff;
src[1] = (tmp1 >> 40) & 0xff;


You could also increase the bit precision up so if you decided to do further adds or some other calculations it would have more room to fudge with, but not much. Say if you gave yourself 20 bits per variable rather than 16, the values can then hold 16x higher for getting say the average of x values at no cost (if divisible by ^2) other than a little difference in how you write it :)

Although you might still get a better result from MMX instructions if you have them in the right order. Don't forget though MMX uses the same register space as floating point, so mixing the two is a big no-no.

Reply via email to