Re: [Pixman] [cairo] [PATCH] Added MIPS32R2 and MIPS DSP ASE optimized functions

Siarhei Siamashka Mon, 13 Sep 2010 06:50:08 -0700

Hello,

Thanks for the patch, it's interesting to see that more CPU architectures are
getting covered.

I have no experience with MIPS and only briefly checked some documentation on
the last weekend. So I'm going to skip DSP ASE optimizations for now. But the
MIPS32R2 seemed to be simple enough, so I had a look at it (just out of
curiosity). This MIPS32R2 code also runs fine in qemu and passes the
'make check' test, which gives a bit more confidence. Too bad that DSP ASE
instructions are apparently not supported in qemu yet, so real hardware is
needed to do any work with them.

On Thursday 09 September 2010 18:30:28 Georgi Beloev wrote:
> +// pixman_bool_t
> +// mips32r2_pixman_fill32(uint32_t *bits, int stride, int x, int y,
> +//   int width, int height, uint32_t  xor)
> +
> +     .global         mips32r2_pixman_fill32
> +     .ent            mips32r2_pixman_fill32
> +
> +mips32r2_pixman_fill32:

This looks mostly like a simple loop unrolling without any extra tricks. Though
assembly code may surely be faster than C (maybe saving one instruction per
loop iteration) because gcc generally has troubles optimizing simple loops:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734

It's not directly related to your patch. But I wonder if it makes sense to also
add a manual loop unrolling to the C variant of pixman_fill32 and the other
similar functions in order to get better general performance on most of
SIMD-less simple processors such as MIPS32R2, older ARMs and the others (who
knows, maybe even opencores.org ones)?

> +// mips32r2_composite_over_n_8_8888_inner(uint32_t *dest, const uint32_t
> src, +//      const uint8_t *mask, int width)
> +
> +     .global         mips32r2_composite_over_n_8_8888_inner
> +     .ent            mips32r2_composite_over_n_8_8888_inner
> +
> +mips32r2_composite_over_n_8_8888_inner:
> +     beqz            $a3, 1f
> +     sll             $a3, $a3, 2     // width <<= 2
> +
> +     addu            $a3, $a0, $a3   // dest_end = dest + width
> +
> +     li              $t7, 0x01000100
> +     li              $t8, 0x00FF00FF // RB_MASK
> +     li              $t9, 0x00800080
> +
> +0:
> +     lbu             $t2, 0($a2)     // mask
> +
> +     // in()
> +
> +     and             $t5, $a1, $t8
> +     mul             $t3, $t5, $t2
> +
> +     lw              $t0, 0($a0)     // dest
> +     addiu           $a2, $a2, 1     // mask++
> +
> +     srl             $t6, $a1, 8
> +     and             $t6, $t6, $t8
> +     mul             $t4, $t6, $t2
> +
> +     addu            $t3, $t3, $t9
> +     srl             $t5, $t3, 8
> +     and             $t5, $t5, $t8
> +     addu            $t3, $t3, $t5
> +     srl             $t3, $t3, 8
> +     and             $t3, $t3, $t8
> +
> +     addu            $t4, $t4, $t9
> +     srl             $t6, $t4, 8
> +     and             $t6, $t6, $t8
> +     addu            $t4, $t4, $t6
> +     srl             $t4, $t4, 8
> +     and             $t4, $t4, $t8
> +
> +     sll             $t4, $t4, 8
> +     or              $t1, $t3, $t4
> +
> +
> +     not             $t2, $t1        // ~in()
> +     srl             $t2, $t2, 24
> +
> +     // over(): UN8_rb_MUL_UN8() and UN8_rb_ADD_UN8_rb()
> +
> +     and             $t5, $t0, $t8
> +     mul             $t3, $t5, $t2
> +
> +     addiu           $a0, $a0, 4     // dest++
> +
> +     srl             $t6, $t0, 8
> +     and             $t6, $t6, $t8
> +     mul             $t4, $t6, $t2
> +
> +     addu            $t3, $t3, $t9
> +     srl             $t5, $t3, 8
> +     and             $t5, $t5, $t8
> +     addu            $t3, $t3, $t5
> +     srl             $t3, $t3, 8
> +     and             $t3, $t3, $t8
> +
> +     and             $t5, $t1, $t8
> +     addu            $t3, $t3, $t5
> +     srl             $t5, $t3, 8
> +     and             $t5, $t5, $t8
> +     subu            $t5, $t7, $t5
> +     or              $t3, $t3, $t5
> +     and             $t3, $t3, $t8
> +
> +     addu            $t4, $t4, $t9
> +     srl             $t6, $t4, 8
> +     and             $t6, $t6, $t8
> +     addu            $t4, $t4, $t6
> +     srl             $t4, $t4, 8
> +     and             $t4, $t4, $t8
> +
> +     srl             $t6, $t1, 8
> +     and             $t6, $t6, $t8
> +     addu            $t4, $t4, $t6
> +     srl             $t6, $t4, 8
> +     and             $t6, $t6, $t8
> +     subu            $t6, $t7, $t6
> +     or              $t4, $t4, $t6
> +     and             $t4, $t4, $t8
> +
> +     sll             $t4, $t4, 8
> +     or              $t3, $t3, $t4
> +
> +     bne             $a0, $a3, 0b
> +     sw              $t3, -4($a0)    // dest
> +
> +1:
> +     jr              $ra
> +     nop
> +
> +     .end            mips32r2_composite_over_n_8_8888_inner

[...]

> +    while (height--)
> +    {
> +             dst = dst_line;
> +             dst_line += dst_stride;
> +             mask = mask_line;
> +             mask_line += mask_stride;
> +
> +             mips32r2_composite_over_n_8_8888_inner(dst, src, mask, width);
> +    }
> +}

The over_n_8_8888 operation is primarily used for rendering text glyphs, so it:
- typically works with small images, maybe having size somewhere around 10x20
- typically has a lot of 0x00 and 0xFF values in the mask
- quite often uses opaque source

Considering all of this, I really wonder about the performance of your assembly
optimized function. Because unlike C code, it does not check for all those 
fully opaque/transparent special cases, but calculates everything taking the
longest path without any shortcuts. Additionally, due to very small image
widths, having call overhead per each scanline may affect performance
noticeably. Did you benchmark the impact of your optimization on some real use
case? The cairo-trace tool is quite useful for recording traces of applications 
and then playing them back for benchmarking purposes. Various profilers
(oprofile, perf, ...) also can be useful for getting more or less
relevant information about the performance of real applications.

I think a good implementation of this function could do something like this:
http://lists.freedesktop.org/archives/pixman/2010-September/000494.html

The assembly code can provide 3 alternative code paths, all handling different
special cases. This can significantly reduce the amount of work to be done per
pixel in the majority of cases. Just because MIPS32R2 does not have a special
instruction for saturated addition, the part implementing it looks very painful
and slow in your code. It is possible to avoid saturated addition altogether at
least for this function on the most common code paths.

I'm not going to demand the absolutely best possible quality of MIPS32R2
optimizations as a requirement for approving patches (the "all or nothing"
approach). But I guess, optimistically, getting more performance is in your
best interests anyway. I just hope that my reply was somewhat useful.

-- 
Best regards,
Siarhei Siamashka

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Pixman mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/pixman

Re: [Pixman] [cairo] [PATCH] Added MIPS32R2 and MIPS DSP ASE optimized functions

Reply via email to