Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-29 Thread Oded Gabbay
On Mon, Sep 28, 2015 at 8:52 AM, Siarhei Siamashka
 wrote:
> On Tue, 22 Sep 2015 16:03:22 +0300
> Oded Gabbay  wrote:
>
>> On Tue, Sep 22, 2015 at 3:59 PM, Pekka Paalanen  wrote:
>> > On Mon, 21 Sep 2015 14:22:53 +0300
>> > Oded Gabbay  wrote:
>> >
>> >> On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
>> >>  wrote:
>> >> >
>> >> > On Thu, 10 Sep 2015 12:27:18 +0300
>> >> > Oded Gabbay  wrote:
>> >> >
>> >> > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  
>> >> > > wrote:
>> >> > > >
>> >> > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
>> >> > > >  wrote:
>> >> > > > > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
>> >> > > > > Gentoo ppc (32-bit userland) gave the following results:
>> >> > > > >
>> >> > > > > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
>> >> > > > > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
>> >> > > > >
>> >> > > > > Signed-off-by: Siarhei Siamashka 
>> >> > > > > ---
>> >> > > > >  pixman/pixman-vmx.c |   54 
>> >> > > > > +++
>> >> > > > >  1 files changed, 54 insertions(+), 0 deletions(-)
>> >> > > > >
>> >> > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
>> >> > > > > index a9bd024..9e551b3 100644
>> >> > > > > --- a/pixman/pixman-vmx.c
>> >> > > > > +++ b/pixman/pixman-vmx.c
>> >> > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ 
>> >> > > > > (pixman_implementation_t *imp,
>> >> > > > >  }
>> >> > > > >
>> >> > > > >  static void
>> >> > > > > +vmx_composite_over_n_ (pixman_implementation_t *imp,
>> >> > > > > +   pixman_composite_info_t *info)
>> >> > > > > +{
>> >> > > > > +PIXMAN_COMPOSITE_ARGS (info);
>> >> > > > > +uint32_t *dst_line, *dst;
>> >> > > > > +uint32_t src, ia;
>> >> > > > > +int  i, w, dst_stride;
>> >> > > > > +vector unsigned int vdst, vsrc, via;
>> >> > > > > +
>> >> > > > > +src = _pixman_image_get_solid (imp, src_image, 
>> >> > > > > dest_image->bits.format);
>> >> > > > > +
>> >> > > > > +if (src == 0)
>> >> > > > > +   return;
>> >> > > > > +
>> >> > > > > +PIXMAN_IMAGE_GET_LINE (
>> >> > > > > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, 
>> >> > > > > dst_line, 1);
>> >> > > > > +
>> >> > > > > +vsrc = (vector unsigned int){src, src, src, src};
>> >> > > > > +via = negate (splat_alpha (vsrc));
>> >> > > > If we will use the over function (see my next comment), we need to
>> >> > > > remove the negate() from the above statement, as it is done in the
>> >> > > > over function.
>> >> > > >
>> >> > > > > +ia = ALPHA_8 (~src);
>> >> > > > > +
>> >> > > > > +while (height--)
>> >> > > > > +{
>> >> > > > > +   dst = dst_line;
>> >> > > > > +   dst_line += dst_stride;
>> >> > > > > +   w = width;
>> >> > > > > +
>> >> > > > > +   while (w && ((uintptr_t)dst & 15))
>> >> > > > > +   {
>> >> > > > > +   uint32_t d = *dst;
>> >> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
>> >> > > > > +   *dst++ = d;
>> >> > > > > +   w--;
>> >> > > > > +   }
>> >> > > > > +
>> >> > > > > +   for (i = w / 4; i > 0; i--)
>> >> > > > > +   {
>> >> > > > > +   vdst = pix_multiply (load_128_aligned (dst), via);
>> >> > > > > +   save_128_aligned (dst, pix_add (vsrc, vdst));
>> >> > > >
>> >> > > > Instead of the above two lines, I would simply use the over function
>> >> > > > in vmx, which does exactly that. So:
>> >> > > > vdst = over(vsrc, via, load_128_aligned(dst))
>> >> > > > save_128_aligned (dst, vdst);
>> >> > > >
>> >> > > > I prefer this as it reuses an existing function which helps
>> >> > > > maintainability, and using it has no impact on performance.
>
> The only difference between these variants is that my variant of the
> patch explicitly moved the calculation of the negated alpha out of the
> loop and is a more accurate representation of the assembly code, which
> is expected to be generated. Your variant relies on the compiler to
> do this job properly. As far as the readability/maintainability is
> concerned, there is no significant difference. There is the same
> number of lines of code. Also my variant of the patch is handling
> the main loop and the leading/trailing pixels in a more uniform
> way. Just compare
>
> via = negate (splat_alpha (vsrc));
> with
> ia = ALPHA_8 (~src);
>
> and
>
> vdst = pix_multiply (load_128_aligned (dst), via);
> save_128_aligned (dst, pix_add (vsrc, vdst));
> with
> uint32_t d = *dst;
> UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> *dst++ = d;
>
>> >> > > > > +   dst += 4;
>> >> > > > > +   }
>> >> > > > > +
>> >> > > > > +  

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-27 Thread Siarhei Siamashka
On Tue, 22 Sep 2015 16:03:22 +0300
Oded Gabbay  wrote:

> On Tue, Sep 22, 2015 at 3:59 PM, Pekka Paalanen  wrote:
> > On Mon, 21 Sep 2015 14:22:53 +0300
> > Oded Gabbay  wrote:
> >
> >> On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
> >>  wrote:
> >> >
> >> > On Thu, 10 Sep 2015 12:27:18 +0300
> >> > Oded Gabbay  wrote:
> >> >
> >> > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  
> >> > > wrote:
> >> > > >
> >> > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> >> > > >  wrote:
> >> > > > > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
> >> > > > > Gentoo ppc (32-bit userland) gave the following results:
> >> > > > >
> >> > > > > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
> >> > > > > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
> >> > > > >
> >> > > > > Signed-off-by: Siarhei Siamashka 
> >> > > > > ---
> >> > > > >  pixman/pixman-vmx.c |   54 
> >> > > > > +++
> >> > > > >  1 files changed, 54 insertions(+), 0 deletions(-)
> >> > > > >
> >> > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> >> > > > > index a9bd024..9e551b3 100644
> >> > > > > --- a/pixman/pixman-vmx.c
> >> > > > > +++ b/pixman/pixman-vmx.c
> >> > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ 
> >> > > > > (pixman_implementation_t *imp,
> >> > > > >  }
> >> > > > >
> >> > > > >  static void
> >> > > > > +vmx_composite_over_n_ (pixman_implementation_t *imp,
> >> > > > > +   pixman_composite_info_t *info)
> >> > > > > +{
> >> > > > > +PIXMAN_COMPOSITE_ARGS (info);
> >> > > > > +uint32_t *dst_line, *dst;
> >> > > > > +uint32_t src, ia;
> >> > > > > +int  i, w, dst_stride;
> >> > > > > +vector unsigned int vdst, vsrc, via;
> >> > > > > +
> >> > > > > +src = _pixman_image_get_solid (imp, src_image, 
> >> > > > > dest_image->bits.format);
> >> > > > > +
> >> > > > > +if (src == 0)
> >> > > > > +   return;
> >> > > > > +
> >> > > > > +PIXMAN_IMAGE_GET_LINE (
> >> > > > > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, 
> >> > > > > dst_line, 1);
> >> > > > > +
> >> > > > > +vsrc = (vector unsigned int){src, src, src, src};
> >> > > > > +via = negate (splat_alpha (vsrc));
> >> > > > If we will use the over function (see my next comment), we need to
> >> > > > remove the negate() from the above statement, as it is done in the
> >> > > > over function.
> >> > > >
> >> > > > > +ia = ALPHA_8 (~src);
> >> > > > > +
> >> > > > > +while (height--)
> >> > > > > +{
> >> > > > > +   dst = dst_line;
> >> > > > > +   dst_line += dst_stride;
> >> > > > > +   w = width;
> >> > > > > +
> >> > > > > +   while (w && ((uintptr_t)dst & 15))
> >> > > > > +   {
> >> > > > > +   uint32_t d = *dst;
> >> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> >> > > > > +   *dst++ = d;
> >> > > > > +   w--;
> >> > > > > +   }
> >> > > > > +
> >> > > > > +   for (i = w / 4; i > 0; i--)
> >> > > > > +   {
> >> > > > > +   vdst = pix_multiply (load_128_aligned (dst), via);
> >> > > > > +   save_128_aligned (dst, pix_add (vsrc, vdst));
> >> > > >
> >> > > > Instead of the above two lines, I would simply use the over function
> >> > > > in vmx, which does exactly that. So:
> >> > > > vdst = over(vsrc, via, load_128_aligned(dst))
> >> > > > save_128_aligned (dst, vdst);
> >> > > >
> >> > > > I prefer this as it reuses an existing function which helps
> >> > > > maintainability, and using it has no impact on performance.

The only difference between these variants is that my variant of the
patch explicitly moved the calculation of the negated alpha out of the
loop and is a more accurate representation of the assembly code, which
is expected to be generated. Your variant relies on the compiler to
do this job properly. As far as the readability/maintainability is
concerned, there is no significant difference. There is the same
number of lines of code. Also my variant of the patch is handling
the main loop and the leading/trailing pixels in a more uniform
way. Just compare

via = negate (splat_alpha (vsrc));
with
ia = ALPHA_8 (~src);

and

vdst = pix_multiply (load_128_aligned (dst), via);
save_128_aligned (dst, pix_add (vsrc, vdst));
with
uint32_t d = *dst;
UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
*dst++ = d;

> >> > > > > +   dst += 4;
> >> > > > > +   }
> >> > > > > +
> >> > > > > +   for (i = w % 4; --i >= 0;)
> >> > > > > +   {
> >> > > > > +   uint32_t d = dst[i];
> >> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> >> > > > > +   dst[i] = d;
> >> > > > > +   }
> >> > > 

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-22 Thread Pekka Paalanen
On Mon, 21 Sep 2015 14:22:53 +0300
Oded Gabbay  wrote:

> On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
>  wrote:
> >
> > On Thu, 10 Sep 2015 12:27:18 +0300
> > Oded Gabbay  wrote:
> >
> > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  
> > > wrote:
> > > >
> > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> > > >  wrote:
> > > > > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
> > > > > Gentoo ppc (32-bit userland) gave the following results:
> > > > >
> > > > > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
> > > > > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
> > > > >
> > > > > Signed-off-by: Siarhei Siamashka 
> > > > > ---
> > > > >  pixman/pixman-vmx.c |   54 
> > > > > +++
> > > > >  1 files changed, 54 insertions(+), 0 deletions(-)
> > > > >
> > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > > > > index a9bd024..9e551b3 100644
> > > > > --- a/pixman/pixman-vmx.c
> > > > > +++ b/pixman/pixman-vmx.c
> > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ 
> > > > > (pixman_implementation_t *imp,
> > > > >  }
> > > > >
> > > > >  static void
> > > > > +vmx_composite_over_n_ (pixman_implementation_t *imp,
> > > > > +   pixman_composite_info_t *info)
> > > > > +{
> > > > > +PIXMAN_COMPOSITE_ARGS (info);
> > > > > +uint32_t *dst_line, *dst;
> > > > > +uint32_t src, ia;
> > > > > +int  i, w, dst_stride;
> > > > > +vector unsigned int vdst, vsrc, via;
> > > > > +
> > > > > +src = _pixman_image_get_solid (imp, src_image, 
> > > > > dest_image->bits.format);
> > > > > +
> > > > > +if (src == 0)
> > > > > +   return;
> > > > > +
> > > > > +PIXMAN_IMAGE_GET_LINE (
> > > > > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 
> > > > > 1);
> > > > > +
> > > > > +vsrc = (vector unsigned int){src, src, src, src};
> > > > > +via = negate (splat_alpha (vsrc));
> > > > If we will use the over function (see my next comment), we need to
> > > > remove the negate() from the above statement, as it is done in the
> > > > over function.
> > > >
> > > > > +ia = ALPHA_8 (~src);
> > > > > +
> > > > > +while (height--)
> > > > > +{
> > > > > +   dst = dst_line;
> > > > > +   dst_line += dst_stride;
> > > > > +   w = width;
> > > > > +
> > > > > +   while (w && ((uintptr_t)dst & 15))
> > > > > +   {
> > > > > +   uint32_t d = *dst;
> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > > > +   *dst++ = d;
> > > > > +   w--;
> > > > > +   }
> > > > > +
> > > > > +   for (i = w / 4; i > 0; i--)
> > > > > +   {
> > > > > +   vdst = pix_multiply (load_128_aligned (dst), via);
> > > > > +   save_128_aligned (dst, pix_add (vsrc, vdst));
> > > >
> > > > Instead of the above two lines, I would simply use the over function
> > > > in vmx, which does exactly that. So:
> > > > vdst = over(vsrc, via, load_128_aligned(dst))
> > > > save_128_aligned (dst, vdst);
> > > >
> > > > I prefer this as it reuses an existing function which helps
> > > > maintainability, and using it has no impact on performance.
> > > >
> > > > > +   dst += 4;
> > > > > +   }
> > > > > +
> > > > > +   for (i = w % 4; --i >= 0;)
> > > > > +   {
> > > > > +   uint32_t d = dst[i];
> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > > > +   dst[i] = d;
> > > > > +   }
> > > > > +}
> > > > > +}
> > > > > +
> > > > > +static void
> > > > >  vmx_composite_over__ (pixman_implementation_t *imp,
> > > > > pixman_composite_info_t *info)
> > > > >  {
> > > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP 
> > > > > (vmx___normal_OVER,
> > > > >
> > > > >  static const pixman_fast_path_t vmx_fast_paths[] =
> > > > >  {
> > > > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, a8r8g8b8, 
> > > > > vmx_composite_over_n_),
> > > > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, x8r8g8b8, 
> > > > > vmx_composite_over_n_),
> > > > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
> > > > > vmx_composite_over__),
> > > > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
> > > > > vmx_composite_over__),
> > > > >  PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
> > > > > vmx_composite_over__),
> > > > > --
> > > > > 1.7.8.6
> > > > >

> 
> I took your advice and run benchmarks with the *non* trimmed traces.
> It takes 144 minutes to run a benchmark on POWER8, 3.4GHz 8 Cores (a
> raw machine, not VM).
> 
> I compared with and without this patch and I got:
> 
> image  ocitysmap 659.69-> 611.71

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-22 Thread Oded Gabbay
On Tue, Sep 22, 2015 at 3:59 PM, Pekka Paalanen  wrote:
> On Mon, 21 Sep 2015 14:22:53 +0300
> Oded Gabbay  wrote:
>
>> On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
>>  wrote:
>> >
>> > On Thu, 10 Sep 2015 12:27:18 +0300
>> > Oded Gabbay  wrote:
>> >
>> > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  
>> > > wrote:
>> > > >
>> > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
>> > > >  wrote:
>> > > > > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
>> > > > > Gentoo ppc (32-bit userland) gave the following results:
>> > > > >
>> > > > > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
>> > > > > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
>> > > > >
>> > > > > Signed-off-by: Siarhei Siamashka 
>> > > > > ---
>> > > > >  pixman/pixman-vmx.c |   54 
>> > > > > +++
>> > > > >  1 files changed, 54 insertions(+), 0 deletions(-)
>> > > > >
>> > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
>> > > > > index a9bd024..9e551b3 100644
>> > > > > --- a/pixman/pixman-vmx.c
>> > > > > +++ b/pixman/pixman-vmx.c
>> > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ 
>> > > > > (pixman_implementation_t *imp,
>> > > > >  }
>> > > > >
>> > > > >  static void
>> > > > > +vmx_composite_over_n_ (pixman_implementation_t *imp,
>> > > > > +   pixman_composite_info_t *info)
>> > > > > +{
>> > > > > +PIXMAN_COMPOSITE_ARGS (info);
>> > > > > +uint32_t *dst_line, *dst;
>> > > > > +uint32_t src, ia;
>> > > > > +int  i, w, dst_stride;
>> > > > > +vector unsigned int vdst, vsrc, via;
>> > > > > +
>> > > > > +src = _pixman_image_get_solid (imp, src_image, 
>> > > > > dest_image->bits.format);
>> > > > > +
>> > > > > +if (src == 0)
>> > > > > +   return;
>> > > > > +
>> > > > > +PIXMAN_IMAGE_GET_LINE (
>> > > > > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 
>> > > > > 1);
>> > > > > +
>> > > > > +vsrc = (vector unsigned int){src, src, src, src};
>> > > > > +via = negate (splat_alpha (vsrc));
>> > > > If we will use the over function (see my next comment), we need to
>> > > > remove the negate() from the above statement, as it is done in the
>> > > > over function.
>> > > >
>> > > > > +ia = ALPHA_8 (~src);
>> > > > > +
>> > > > > +while (height--)
>> > > > > +{
>> > > > > +   dst = dst_line;
>> > > > > +   dst_line += dst_stride;
>> > > > > +   w = width;
>> > > > > +
>> > > > > +   while (w && ((uintptr_t)dst & 15))
>> > > > > +   {
>> > > > > +   uint32_t d = *dst;
>> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
>> > > > > +   *dst++ = d;
>> > > > > +   w--;
>> > > > > +   }
>> > > > > +
>> > > > > +   for (i = w / 4; i > 0; i--)
>> > > > > +   {
>> > > > > +   vdst = pix_multiply (load_128_aligned (dst), via);
>> > > > > +   save_128_aligned (dst, pix_add (vsrc, vdst));
>> > > >
>> > > > Instead of the above two lines, I would simply use the over function
>> > > > in vmx, which does exactly that. So:
>> > > > vdst = over(vsrc, via, load_128_aligned(dst))
>> > > > save_128_aligned (dst, vdst);
>> > > >
>> > > > I prefer this as it reuses an existing function which helps
>> > > > maintainability, and using it has no impact on performance.
>> > > >
>> > > > > +   dst += 4;
>> > > > > +   }
>> > > > > +
>> > > > > +   for (i = w % 4; --i >= 0;)
>> > > > > +   {
>> > > > > +   uint32_t d = dst[i];
>> > > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
>> > > > > +   dst[i] = d;
>> > > > > +   }
>> > > > > +}
>> > > > > +}
>> > > > > +
>> > > > > +static void
>> > > > >  vmx_composite_over__ (pixman_implementation_t *imp,
>> > > > > pixman_composite_info_t *info)
>> > > > >  {
>> > > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP 
>> > > > > (vmx___normal_OVER,
>> > > > >
>> > > > >  static const pixman_fast_path_t vmx_fast_paths[] =
>> > > > >  {
>> > > > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, a8r8g8b8, 
>> > > > > vmx_composite_over_n_),
>> > > > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, x8r8g8b8, 
>> > > > > vmx_composite_over_n_),
>> > > > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
>> > > > > vmx_composite_over__),
>> > > > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
>> > > > > vmx_composite_over__),
>> > > > >  PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
>> > > > > vmx_composite_over__),
>> > > > > --
>> > > > > 1.7.8.6
>> > > > >
>
>>
>> I took your advice and run benchmarks with the *non* trimmed traces.
>> It 

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-21 Thread Oded Gabbay
On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
 wrote:
>
> On Thu, 10 Sep 2015 12:27:18 +0300
> Oded Gabbay  wrote:
>
> > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  wrote:
> > >
> > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> > >  wrote:
> > > > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
> > > > Gentoo ppc (32-bit userland) gave the following results:
> > > >
> > > > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
> > > > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
> > > >
> > > > Signed-off-by: Siarhei Siamashka 
> > > > ---
> > > >  pixman/pixman-vmx.c |   54 
> > > > +++
> > > >  1 files changed, 54 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > > > index a9bd024..9e551b3 100644
> > > > --- a/pixman/pixman-vmx.c
> > > > +++ b/pixman/pixman-vmx.c
> > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ 
> > > > (pixman_implementation_t *imp,
> > > >  }
> > > >
> > > >  static void
> > > > +vmx_composite_over_n_ (pixman_implementation_t *imp,
> > > > +   pixman_composite_info_t *info)
> > > > +{
> > > > +PIXMAN_COMPOSITE_ARGS (info);
> > > > +uint32_t *dst_line, *dst;
> > > > +uint32_t src, ia;
> > > > +int  i, w, dst_stride;
> > > > +vector unsigned int vdst, vsrc, via;
> > > > +
> > > > +src = _pixman_image_get_solid (imp, src_image, 
> > > > dest_image->bits.format);
> > > > +
> > > > +if (src == 0)
> > > > +   return;
> > > > +
> > > > +PIXMAN_IMAGE_GET_LINE (
> > > > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> > > > +
> > > > +vsrc = (vector unsigned int){src, src, src, src};
> > > > +via = negate (splat_alpha (vsrc));
> > > If we will use the over function (see my next comment), we need to
> > > remove the negate() from the above statement, as it is done in the
> > > over function.
> > >
> > > > +ia = ALPHA_8 (~src);
> > > > +
> > > > +while (height--)
> > > > +{
> > > > +   dst = dst_line;
> > > > +   dst_line += dst_stride;
> > > > +   w = width;
> > > > +
> > > > +   while (w && ((uintptr_t)dst & 15))
> > > > +   {
> > > > +   uint32_t d = *dst;
> > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > > +   *dst++ = d;
> > > > +   w--;
> > > > +   }
> > > > +
> > > > +   for (i = w / 4; i > 0; i--)
> > > > +   {
> > > > +   vdst = pix_multiply (load_128_aligned (dst), via);
> > > > +   save_128_aligned (dst, pix_add (vsrc, vdst));
> > >
> > > Instead of the above two lines, I would simply use the over function
> > > in vmx, which does exactly that. So:
> > > vdst = over(vsrc, via, load_128_aligned(dst))
> > > save_128_aligned (dst, vdst);
> > >
> > > I prefer this as it reuses an existing function which helps
> > > maintainability, and using it has no impact on performance.
> > >
> > > > +   dst += 4;
> > > > +   }
> > > > +
> > > > +   for (i = w % 4; --i >= 0;)
> > > > +   {
> > > > +   uint32_t d = dst[i];
> > > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > > +   dst[i] = d;
> > > > +   }
> > > > +}
> > > > +}
> > > > +
> > > > +static void
> > > >  vmx_composite_over__ (pixman_implementation_t *imp,
> > > > pixman_composite_info_t *info)
> > > >  {
> > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx___normal_OVER,
> > > >
> > > >  static const pixman_fast_path_t vmx_fast_paths[] =
> > > >  {
> > > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, a8r8g8b8, 
> > > > vmx_composite_over_n_),
> > > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, x8r8g8b8, 
> > > > vmx_composite_over_n_),
> > > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
> > > > vmx_composite_over__),
> > > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
> > > > vmx_composite_over__),
> > > >  PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
> > > > vmx_composite_over__),
> > > > --
> > > > 1.7.8.6
> > > >
> > >
> > > Indeed, this implementation is much better than what I did.
> > > Apparently, converting sse2 to vmx calls isn't the optimal way.
> > > On my POWER8 machine, I get:
> > >
> > > reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> > > L1  572.29  1539.47 +169.00%
> > > L2  1038.08  1549.04 +49.22%
> > > M  1104.1  1522.22 +37.87%
> > > HT  447.45  676.32 +51.15%
> > > VT  520.82  764.82 +46.85%
> > > R  407.92  570.54 +39.87%
> > > 

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-10 Thread Oded Gabbay
On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  wrote:
>
> On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
>  wrote:
> > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
> > Gentoo ppc (32-bit userland) gave the following results:
> >
> > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
> > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
> >
> > Signed-off-by: Siarhei Siamashka 
> > ---
> >  pixman/pixman-vmx.c |   54 
> > +++
> >  1 files changed, 54 insertions(+), 0 deletions(-)
> >
> > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > index a9bd024..9e551b3 100644
> > --- a/pixman/pixman-vmx.c
> > +++ b/pixman/pixman-vmx.c
> > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ (pixman_implementation_t 
> > *imp,
> >  }
> >
> >  static void
> > +vmx_composite_over_n_ (pixman_implementation_t *imp,
> > +   pixman_composite_info_t *info)
> > +{
> > +PIXMAN_COMPOSITE_ARGS (info);
> > +uint32_t *dst_line, *dst;
> > +uint32_t src, ia;
> > +int  i, w, dst_stride;
> > +vector unsigned int vdst, vsrc, via;
> > +
> > +src = _pixman_image_get_solid (imp, src_image, 
> > dest_image->bits.format);
> > +
> > +if (src == 0)
> > +   return;
> > +
> > +PIXMAN_IMAGE_GET_LINE (
> > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> > +
> > +vsrc = (vector unsigned int){src, src, src, src};
> > +via = negate (splat_alpha (vsrc));
> If we will use the over function (see my next comment), we need to
> remove the negate() from the above statement, as it is done in the
> over function.
>
> > +ia = ALPHA_8 (~src);
> > +
> > +while (height--)
> > +{
> > +   dst = dst_line;
> > +   dst_line += dst_stride;
> > +   w = width;
> > +
> > +   while (w && ((uintptr_t)dst & 15))
> > +   {
> > +   uint32_t d = *dst;
> > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > +   *dst++ = d;
> > +   w--;
> > +   }
> > +
> > +   for (i = w / 4; i > 0; i--)
> > +   {
> > +   vdst = pix_multiply (load_128_aligned (dst), via);
> > +   save_128_aligned (dst, pix_add (vsrc, vdst));
>
> Instead of the above two lines, I would simply use the over function
> in vmx, which does exactly that. So:
> vdst = over(vsrc, via, load_128_aligned(dst))
> save_128_aligned (dst, vdst);
>
> I prefer this as it reuses an existing function which helps
> maintainability, and using it has no impact on performance.
>
> > +   dst += 4;
> > +   }
> > +
> > +   for (i = w % 4; --i >= 0;)
> > +   {
> > +   uint32_t d = dst[i];
> > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > +   dst[i] = d;
> > +   }
> > +}
> > +}
> > +
> > +static void
> >  vmx_composite_over__ (pixman_implementation_t *imp,
> > pixman_composite_info_t *info)
> >  {
> > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx___normal_OVER,
> >
> >  static const pixman_fast_path_t vmx_fast_paths[] =
> >  {
> > +PIXMAN_STD_FAST_PATH (OVER, solid,null, a8r8g8b8, 
> > vmx_composite_over_n_),
> > +PIXMAN_STD_FAST_PATH (OVER, solid,null, x8r8g8b8, 
> > vmx_composite_over_n_),
> >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
> > vmx_composite_over__),
> >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
> > vmx_composite_over__),
> >  PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
> > vmx_composite_over__),
> > --
> > 1.7.8.6
> >
>
> Indeed, this implementation is much better than what I did.
> Apparently, converting sse2 to vmx calls isn't the optimal way.
> On my POWER8 machine, I get:
>
> reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> L1  572.29  1539.47 +169.00%
> L2  1038.08  1549.04 +49.22%
> M  1104.1  1522.22 +37.87%
> HT  447.45  676.32 +51.15%
> VT  520.82  764.82 +46.85%
> R  407.92  570.54 +39.87%
> RT  148.9  208.77 +40.21%
> Kops/s  1100  1418 +28.91%
>
> So, assuming the change above, this patch is:
>
> Reviewed-by: Oded Gabbay 


Hi Siarhei,

After I fixed my cairo setup (See
http://lists.freedesktop.org/archives/pixman/2015-September/003987.html),
I went and re-tested your patch with cairo trimmed benchmark against
current pixman master.
Unfortunately, it gives a minor slowdown:

Slowdowns
=
t-firefox-scrolling  1232.30 -> 1295.75 :  1.05x slowdown

even if I apply your patch over my latest patch-set (that was inspired
by your patch), I still get a slowdown, albeit in 

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-10 Thread Siarhei Siamashka
On Thu, 10 Sep 2015 12:27:18 +0300
Oded Gabbay  wrote:

> On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay  wrote:
> >
> > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> >  wrote:
> > > Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
> > > Gentoo ppc (32-bit userland) gave the following results:
> > >
> > > before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
> > > after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
> > >
> > > Signed-off-by: Siarhei Siamashka 
> > > ---
> > >  pixman/pixman-vmx.c |   54 
> > > +++
> > >  1 files changed, 54 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > > index a9bd024..9e551b3 100644
> > > --- a/pixman/pixman-vmx.c
> > > +++ b/pixman/pixman-vmx.c
> > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ 
> > > (pixman_implementation_t *imp,
> > >  }
> > >
> > >  static void
> > > +vmx_composite_over_n_ (pixman_implementation_t *imp,
> > > +   pixman_composite_info_t *info)
> > > +{
> > > +PIXMAN_COMPOSITE_ARGS (info);
> > > +uint32_t *dst_line, *dst;
> > > +uint32_t src, ia;
> > > +int  i, w, dst_stride;
> > > +vector unsigned int vdst, vsrc, via;
> > > +
> > > +src = _pixman_image_get_solid (imp, src_image, 
> > > dest_image->bits.format);
> > > +
> > > +if (src == 0)
> > > +   return;
> > > +
> > > +PIXMAN_IMAGE_GET_LINE (
> > > +   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> > > +
> > > +vsrc = (vector unsigned int){src, src, src, src};
> > > +via = negate (splat_alpha (vsrc));
> > If we will use the over function (see my next comment), we need to
> > remove the negate() from the above statement, as it is done in the
> > over function.
> >
> > > +ia = ALPHA_8 (~src);
> > > +
> > > +while (height--)
> > > +{
> > > +   dst = dst_line;
> > > +   dst_line += dst_stride;
> > > +   w = width;
> > > +
> > > +   while (w && ((uintptr_t)dst & 15))
> > > +   {
> > > +   uint32_t d = *dst;
> > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > +   *dst++ = d;
> > > +   w--;
> > > +   }
> > > +
> > > +   for (i = w / 4; i > 0; i--)
> > > +   {
> > > +   vdst = pix_multiply (load_128_aligned (dst), via);
> > > +   save_128_aligned (dst, pix_add (vsrc, vdst));
> >
> > Instead of the above two lines, I would simply use the over function
> > in vmx, which does exactly that. So:
> > vdst = over(vsrc, via, load_128_aligned(dst))
> > save_128_aligned (dst, vdst);
> >
> > I prefer this as it reuses an existing function which helps
> > maintainability, and using it has no impact on performance.
> >
> > > +   dst += 4;
> > > +   }
> > > +
> > > +   for (i = w % 4; --i >= 0;)
> > > +   {
> > > +   uint32_t d = dst[i];
> > > +   UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > +   dst[i] = d;
> > > +   }
> > > +}
> > > +}
> > > +
> > > +static void
> > >  vmx_composite_over__ (pixman_implementation_t *imp,
> > > pixman_composite_info_t *info)
> > >  {
> > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx___normal_OVER,
> > >
> > >  static const pixman_fast_path_t vmx_fast_paths[] =
> > >  {
> > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, a8r8g8b8, 
> > > vmx_composite_over_n_),
> > > +PIXMAN_STD_FAST_PATH (OVER, solid,null, x8r8g8b8, 
> > > vmx_composite_over_n_),
> > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
> > > vmx_composite_over__),
> > >  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
> > > vmx_composite_over__),
> > >  PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
> > > vmx_composite_over__),
> > > --
> > > 1.7.8.6
> > >
> >
> > Indeed, this implementation is much better than what I did.
> > Apparently, converting sse2 to vmx calls isn't the optimal way.
> > On my POWER8 machine, I get:
> >
> > reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> > L1  572.29  1539.47 +169.00%
> > L2  1038.08  1549.04 +49.22%
> > M  1104.1  1522.22 +37.87%
> > HT  447.45  676.32 +51.15%
> > VT  520.82  764.82 +46.85%
> > R  407.92  570.54 +39.87%
> > RT  148.9  208.77 +40.21%
> > Kops/s  1100  1418 +28.91%
> >
> > So, assuming the change above, this patch is:
> >
> > Reviewed-by: Oded Gabbay 
> 
> 
> Hi Siarhei,

Hi,
 
> After I fixed my cairo setup (See
> 

Re: [Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

2015-09-04 Thread Siarhei Siamashka
On Fri,  4 Sep 2015 15:39:00 +0300
Siarhei Siamashka  wrote:

> Running "lowlevel-blt-bench over_n_" on Playstation3 3.2GHz,
> Gentoo ppc (32-bit userland) gave the following results:
> 
> before:  over_n_ =  L1: 147.47  L2: 205.86  M:121.07
> after:   over_n_ =  L1: 287.27  L2: 261.09  M:133.48
> 
> Signed-off-by: Siarhei Siamashka 
> ---
>  pixman/pixman-vmx.c |   54 
> +++
>  1 files changed, 54 insertions(+), 0 deletions(-)
> 
> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> index a9bd024..9e551b3 100644
> --- a/pixman/pixman-vmx.c
> +++ b/pixman/pixman-vmx.c
> @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_ (pixman_implementation_t 
> *imp,
>  }
>  
>  static void
> +vmx_composite_over_n_ (pixman_implementation_t *imp,
> +   pixman_composite_info_t *info)
> +{
> +PIXMAN_COMPOSITE_ARGS (info);
> +uint32_t *dst_line, *dst;
> +uint32_t src, ia;
> +int  i, w, dst_stride;
> +vector unsigned int vdst, vsrc, via;
> +
> +src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
> +
> +if (src == 0)
> + return;
> +
> +PIXMAN_IMAGE_GET_LINE (
> + dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> +
> +vsrc = (vector unsigned int){src, src, src, src};
> +via = negate (splat_alpha (vsrc));
> +ia = ALPHA_8 (~src);
> +
> +while (height--)
> +{
> + dst = dst_line;
> + dst_line += dst_stride;
> + w = width;
> +
> + while (w && ((uintptr_t)dst & 15))
> + {
> + uint32_t d = *dst;
> + UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> + *dst++ = d;
> + w--;
> + }
> +
> + for (i = w / 4; i > 0; i--)
> + {

And BTW, there is a major speed boost on Playstation3 (Cell BE)
if we add software prefetch here:

   __builtin_prefetch (dst + 32, 0, 0);

In this case we get the following result from lowlevel-blt-bench:

 over_n_ =  L1: 293.84  L2: 303.40  M:224.09 (153.67%)
 HT:131.41  VT:105.68  R: 74.14  RT: 25.80 ( 214Kops/s)

The 'M' performance increases from ~133 MPix/s to ~224 MPix/s.

But I guess, POWER8 and other big desktop/server powerpc systems should
have automatic hardware prefetch and have no need for software prefetch.

> + vdst = pix_multiply (load_128_aligned (dst), via);
> + save_128_aligned (dst, pix_add (vsrc, vdst));
> + dst += 4;
> + }
> +
> + for (i = w % 4; --i >= 0;)
> + {
> + uint32_t d = dst[i];
> + UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> + dst[i] = d;
> + }
> +}
> +}
> +
> +static void
>  vmx_composite_over__ (pixman_implementation_t *imp,
> pixman_composite_info_t *info)
>  {
> @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx___normal_OVER,
>  
>  static const pixman_fast_path_t vmx_fast_paths[] =
>  {
> +PIXMAN_STD_FAST_PATH (OVER, solid,null, a8r8g8b8, 
> vmx_composite_over_n_),
> +PIXMAN_STD_FAST_PATH (OVER, solid,null, x8r8g8b8, 
> vmx_composite_over_n_),
>  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
> vmx_composite_over__),
>  PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
> vmx_composite_over__),
>  PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
> vmx_composite_over__),

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman