On Wed, May 16, 2012 at 5:27 AM, Siarhei Siamashka <siarhei.siamas...@gmail.com> wrote: > +/* > + * Multiply pixel (a8) with single pixel (a8r8g8b8). It requires maskLSR > + * needed for rounding process. maskLSR must have following value: > + * li maskLSR, 0x00ff00ff > + */ > +.macro MIPS_UN8x4_MUL_UN8 s_8888, \ > + m_8, \ > + d_8888, \ > + maskLSR, \ > + scratch1, scratch2, scratch3 > + replv.ph \m_8, \m_8 /* 0 | M | 0 | M */ > + muleu_s.ph.qbl \scratch1, \s_8888, \m_8 /* A*M | R*M */ > + muleu_s.ph.qbr \scratch2, \s_8888, \m_8 /* G*M | B*M */ > + shra_r.ph \scratch3, \scratch1, 8 > + shra_r.ph \d_8888, \scratch2, 8 > + and \scratch3, \scratch3, \maskLSR /* 0 |A*M| 0 |R*M */ > + and \d_8888, \d_8888, \maskLSR /* 0 |G*M| 0 |B*M */ > + addq.ph \scratch1, \scratch1, \scratch3 /* A*M+A*M | R*M+R*M */ > + addq.ph \scratch2, \scratch2, \d_8888 /* G*M+G*M | B*M+B*M */ > + shra_r.ph \scratch1, \scratch1, 8 > + shra_r.ph \scratch2, \scratch2, 8 > + precr.qb.ph \d_8888, \scratch1, \scratch2 > +.endm > > A possible alternative way is to just use a single MULQ_RS.W > instruction for each color component. That's total 5 instructions > because 8-bit alpha value from mask needs to be premultiplied by > 8421504. A test program is listed below: > > /***********************/ > > #include <stdio.h> > #include <stdint.h> > > int mul_un8(int a, int b) > { > #if 1 > int t = a * b + 0x80; > return (t + (t >> 8)) >> 8; > #else > return (a * b + 127) / 255; > #endif > } > > int mul_un8_mips(int a, int b) > { > int c; > b *= 8421504; > #if 1 > asm ("mulq_rs.w %0, %1, %2" : "=r" (c) : "r" (a), "r" (b)); > #else > c = ((int64_t)a * b + (1 << 30)) >> 31; > #endif > return c; > } > > int main() > { > int a, b; > for (a = 0; a < 256; a++) > { > for (b = 0; b < 256; b++) > { > if (mul_un8(a, b) != mul_un8_mips(a, b)) > { > printf("test failed! a=%d b=%d\n", a, b); > return 1; > } > } > } > printf("test passed\n"); > return 0; > } > > /***********************/ > > There is only one problem with MULQ_RS.W instruction: it seems to have > a huge ~14 (!) cycles latency. But the throughput is ok (1 cycle per > instruction). So in order to hide this latency, some serious loop > unrolling may be needed (possibly up to handling 4 pixels at once). > The other MIPS DSP ASE fast path functions may also try to use > MULQ_RS.W
Maybe a bit more explanations would be useful. When multiplying 8-bit color component by 8-bit alpha for OVER operator, pixman actually wants to do: x' = (x * a + (255 / 2)) / 255; This is division by 255 and rounding the result to nearest integer. Because integer division is slow, C implementation replaces it with shifts and additions: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n27 t = x * a + 0x80; x' = (t + (t >> 8)) >> 8; This method of calculation is also good because the intermediate results fit unsigned 16-bit variables. Which also allows to use SIMD-alike trick to process two color components at once on 32-bit systems: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n40 /* xy = 0x00AB00CD, where AB is one color component, CD - another */ t = xy * a + 0x00800080; xy` = ((t + ((t >> 8) & 0x00FF00FF)) >> 8) & 0x00FF00FF; But modern processors may support some nice instructions which can do this job even better. It makes sense to revert C optimizations and look at the original ((x * a + (255 / 2)) / 255) formula again. MIPS DSP ASE has a special instruction MULQ_RS.W for rounded fixed point Q31 multiplication (((int64_t)a * b + (1 << 30)) >> 31) and it can be used quite conveniently here because we get a shift and rounding for free. We just need to use the Q31 representation of 1/255 for multiplication by reciprocal. MIPS DSP ASE also supports Q15 fixed point multiplication and it could have been even nicer, but Q15 precision is apparently insufficient for getting bit exact results in this case. Looking at the original intended formula is always useful. Because it allows to try and benchmark alternative implementations of the same calculations, selecting the best one for the target hardware. Maybe MIPS r5g6b5 -> x8r8g8b8 pixel format conversion could also borrow some ideas from the other discussion thread: http://lists.freedesktop.org/archives/pixman/2012-May/001958.html -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman