Hello, Thanks for the patch, it's interesting to see that more CPU architectures are getting covered.
I have no experience with MIPS and only briefly checked some documentation on the last weekend. So I'm going to skip DSP ASE optimizations for now. But the MIPS32R2 seemed to be simple enough, so I had a look at it (just out of curiosity). This MIPS32R2 code also runs fine in qemu and passes the 'make check' test, which gives a bit more confidence. Too bad that DSP ASE instructions are apparently not supported in qemu yet, so real hardware is needed to do any work with them. On Thursday 09 September 2010 18:30:28 Georgi Beloev wrote: > +// pixman_bool_t > +// mips32r2_pixman_fill32(uint32_t *bits, int stride, int x, int y, > +// int width, int height, uint32_t xor) > + > + .global mips32r2_pixman_fill32 > + .ent mips32r2_pixman_fill32 > + > +mips32r2_pixman_fill32: This looks mostly like a simple loop unrolling without any extra tricks. Though assembly code may surely be faster than C (maybe saving one instruction per loop iteration) because gcc generally has troubles optimizing simple loops: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734 It's not directly related to your patch. But I wonder if it makes sense to also add a manual loop unrolling to the C variant of pixman_fill32 and the other similar functions in order to get better general performance on most of SIMD-less simple processors such as MIPS32R2, older ARMs and the others (who knows, maybe even opencores.org ones)? > +// mips32r2_composite_over_n_8_8888_inner(uint32_t *dest, const uint32_t > src, +// const uint8_t *mask, int width) > + > + .global mips32r2_composite_over_n_8_8888_inner > + .ent mips32r2_composite_over_n_8_8888_inner > + > +mips32r2_composite_over_n_8_8888_inner: > + beqz $a3, 1f > + sll $a3, $a3, 2 // width <<= 2 > + > + addu $a3, $a0, $a3 // dest_end = dest + width > + > + li $t7, 0x01000100 > + li $t8, 0x00FF00FF // RB_MASK > + li $t9, 0x00800080 > + > +0: > + lbu $t2, 0($a2) // mask > + > + // in() > + > + and $t5, $a1, $t8 > + mul $t3, $t5, $t2 > + > + lw $t0, 0($a0) // dest > + addiu $a2, $a2, 1 // mask++ > + > + srl $t6, $a1, 8 > + and $t6, $t6, $t8 > + mul $t4, $t6, $t2 > + > + addu $t3, $t3, $t9 > + srl $t5, $t3, 8 > + and $t5, $t5, $t8 > + addu $t3, $t3, $t5 > + srl $t3, $t3, 8 > + and $t3, $t3, $t8 > + > + addu $t4, $t4, $t9 > + srl $t6, $t4, 8 > + and $t6, $t6, $t8 > + addu $t4, $t4, $t6 > + srl $t4, $t4, 8 > + and $t4, $t4, $t8 > + > + sll $t4, $t4, 8 > + or $t1, $t3, $t4 > + > + > + not $t2, $t1 // ~in() > + srl $t2, $t2, 24 > + > + // over(): UN8_rb_MUL_UN8() and UN8_rb_ADD_UN8_rb() > + > + and $t5, $t0, $t8 > + mul $t3, $t5, $t2 > + > + addiu $a0, $a0, 4 // dest++ > + > + srl $t6, $t0, 8 > + and $t6, $t6, $t8 > + mul $t4, $t6, $t2 > + > + addu $t3, $t3, $t9 > + srl $t5, $t3, 8 > + and $t5, $t5, $t8 > + addu $t3, $t3, $t5 > + srl $t3, $t3, 8 > + and $t3, $t3, $t8 > + > + and $t5, $t1, $t8 > + addu $t3, $t3, $t5 > + srl $t5, $t3, 8 > + and $t5, $t5, $t8 > + subu $t5, $t7, $t5 > + or $t3, $t3, $t5 > + and $t3, $t3, $t8 > + > + addu $t4, $t4, $t9 > + srl $t6, $t4, 8 > + and $t6, $t6, $t8 > + addu $t4, $t4, $t6 > + srl $t4, $t4, 8 > + and $t4, $t4, $t8 > + > + srl $t6, $t1, 8 > + and $t6, $t6, $t8 > + addu $t4, $t4, $t6 > + srl $t6, $t4, 8 > + and $t6, $t6, $t8 > + subu $t6, $t7, $t6 > + or $t4, $t4, $t6 > + and $t4, $t4, $t8 > + > + sll $t4, $t4, 8 > + or $t3, $t3, $t4 > + > + bne $a0, $a3, 0b > + sw $t3, -4($a0) // dest > + > +1: > + jr $ra > + nop > + > + .end mips32r2_composite_over_n_8_8888_inner [...] > + while (height--) > + { > + dst = dst_line; > + dst_line += dst_stride; > + mask = mask_line; > + mask_line += mask_stride; > + > + mips32r2_composite_over_n_8_8888_inner(dst, src, mask, width); > + } > +} The over_n_8_8888 operation is primarily used for rendering text glyphs, so it: - typically works with small images, maybe having size somewhere around 10x20 - typically has a lot of 0x00 and 0xFF values in the mask - quite often uses opaque source Considering all of this, I really wonder about the performance of your assembly optimized function. Because unlike C code, it does not check for all those fully opaque/transparent special cases, but calculates everything taking the longest path without any shortcuts. Additionally, due to very small image widths, having call overhead per each scanline may affect performance noticeably. Did you benchmark the impact of your optimization on some real use case? The cairo-trace tool is quite useful for recording traces of applications and then playing them back for benchmarking purposes. Various profilers (oprofile, perf, ...) also can be useful for getting more or less relevant information about the performance of real applications. I think a good implementation of this function could do something like this: http://lists.freedesktop.org/archives/pixman/2010-September/000494.html The assembly code can provide 3 alternative code paths, all handling different special cases. This can significantly reduce the amount of work to be done per pixel in the majority of cases. Just because MIPS32R2 does not have a special instruction for saturated addition, the part implementing it looks very painful and slow in your code. It is possible to avoid saturated addition altogether at least for this function on the most common code paths. I'm not going to demand the absolutely best possible quality of MIPS32R2 optimizations as a requirement for approving patches (the "all or nothing" approach). But I guess, optimistically, getting more performance is in your best interests anyway. I just hope that my reply was somewhat useful. -- Best regards, Siarhei Siamashka
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
