Re: Can I get a more in-depth guide about the inline assembler?

Era Scarecrow via Digitalmars-d-learn Thu, 02 Jun 2016 09:41:03 -0700

On Thursday, 2 June 2016 at 13:32:51 UTC, ZILtoid1991 wrote:

On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
Could you also paste the D version of your code? Perhaps thecompiler (LDC, GDC) will generate similarly vectorized codethat is inlinable, etc.
ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 -src[0]))>>8);dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 -src[0]))>>8);dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 -src[0]))>>8);
*p = dest2;
The main problem with this is that it's much slower, even if Iwould calculate the alpha blending values once. The assemblycode does not seem to have higher impact than the "replace ifalpha = 255" algorithm:
if(src[0] == 255){
*p = src;
}
It also seems I have a quite few problems with the assemblycode, mostly with the pmulhuw command (it returns the higher 16bit of the result, I need the lower 16 bit as unsigned), alsowith the pointers, as the read outs and write backs doesn'tland to their correct places, sometimes resulting in aflickering screen or wrong colors affecting neighboring pixels.Current assembly code:

I'd say the major portion of your speedup happens to be becauseyou're trying to do 3 things at once. Rather specifically,because you're working with 3 8bit colors, you have 24bits ofdata to work with, and by adding 8bits for fixed floating pointyou can do a multiply and do 4 small multiplies in a singlecommand.

You'd probably get a similar effect from bit shifting before andafter the results. Since you're working with 3 colors and thealpha/multiplier... This assumes you do it without MMX. (reduces6 multiplies to a mere 2)


ulong tmp1 = (src[1] << 32) | (src[2] << 16) | src[3];
ulong tmp2 = (dest2[1] << 32) | (dest2[2] << 16) | dest2[3];

tmp1 *= src[0]+1;
tmp1 += tmp2*(256 - src[0]);

src[3] = (tmp1 >> 8) & 0xff;
src[2] = (tmp1 >> 24) & 0xff;
src[1] = (tmp1 >> 40) & 0xff;

You could also increase the bit precision up so if you decidedto do further adds or some other calculations it would have moreroom to fudge with, but not much. Say if you gave yourself 20bits per variable rather than 16, the values can then hold 16xhigher for getting say the average of x values at no cost (ifdivisible by ^2) other than a little difference in how you writeit :)

Although you might still get a better result from MMXinstructions if you have them in the right order. Don't forgetthough MMX uses the same register space as floating point, somixing the two is a big no-no.

Re: Can I get a more in-depth guide about the inline assembler?

Reply via email to